Biological Data Analysis Using R
Biological Data Analysis Using R
2
e
(x)
2
2
2
.
Exponential The exponential density has a continuous density function of P(x|) =
1 e
x
.
Poisson The Poisson distribution is a discrete distribution whose density function is
P(k|) =
e
k
k!
.
Later in the Exercises you will get to use some of these distribution.
Histograms
A histogram is a graphical display of data that has been tallied into bins (e.g., specic
buckets). How you dene the bucket locations and sizes are up to you. You can specify
that there should be a specic number of buckets and R will make them equal sized, or
Biological Data Analysis Using R
4.2. RANDOM NUMBER GENERATION 55
Figure 4.7: Examples of the densities of two normal distributions; the red one is drawn from a
random normal distribution with default values of = 0 and = 1 and another in blue that has
= = 5.
you can dene ranges yourself. The function signature for the hist () function by typing
?hist in R :
hi st ( x, breaks = "Sturges" ,
f req = NULL, probabi l i ty = ! freq ,
include . lowest = TRUE, ri ght = TRUE,
density = NULL, angle = 45, col = NULL, border = NULL,
main = paste ( "Histogram of" , xname) ,
xlim = range ( breaks ) , ylim = NULL,
xlab = xname, ylab ,
axes = TRUE, pl ot = TRUE, l abel s = FALSE,
nclass = NULL, . . . )
There are several things we should notice about this function signature. First, this is the
rst time that weve looked into a particular function and seen all the options. You can
see that several of the parameters are given what we call default values (e.g., the =VALUE
portions). That way if we do not provide a particular value for a parameter such as main,
it will ll it in for you.
Biological Data Analysis Using R
56 CHAPTER 4. SUMMARY STATISTICS
The rst thing that you typically want to change in a graphic is the default values for
the axis labels and the title of the graph. It is not commonly accepted practice to provide
titles on graphs for most publication-quality graphics, but some times it is helpful when
you are putting together a talk or just analyzing the data and making graphics for your
own interpretation. To change the default values of the axis labels and set an empty title
you would do the following (shown in Figure 4.8):
> hi st ( rnorm( 100) , xlab="My Defined Bin Categories" , ylab="Frequency" , main="")
Figure 4.8: Histogram with labels and main title changed.
Again, I am using the function rnorm() to generate the data from a random normal distri-
bution here. It is perfectly OK to give empty values to things like titles and such.
Density Plots
A density plot is one where the probability density is calculated and turned into a line
across the domain rather than a histogram. Here I will combine the histogram and
density plots to show how to overlay two graphs on the same values.
> data < rpoi s ( lambda=5,n=1000)
> den < density ( data )
> den
Cal l :
density . def aul t ( x = data )
Data: y (1000 obs . ) ; Bandwidth bw = 0.5061
x y
Min. :1.518 Min. :3.567e05
1st Qu. : 2.491 1st Qu.:8.145e03
Biological Data Analysis Using R
4.2. RANDOM NUMBER GENERATION 57
Median : 6.500 Median :3.973e02
Mean : 6.500 Mean :6.229e02
3rd Qu.:10.509 3rd Qu.:1.219e01
Max. :14.518 Max. :1.689e01
> yrange < range ( den$y )
> xrange < range ( den$x )
> hi st ( data , ylim=yrange , xlim=xrange , xlab="Value of Random Poisson" ,
+ ylab="Frequency" ,main="" , probabi l i ty=T, bty="n")
> par ( new=T)
> pl ot ( den, col ="red" , lwd=2, xlab="" , ylab="" ,main="" , bty="n")
Figure 4.9: Histogram of 1000 random numbers drawn from a Poisson distribution with the
parameter set to 5. The red line indicates the density of the values.
There are some things to point out with this plot.
1. I save the values of data as a variable because I needed to plot the same set of
random variables as a histogram and as a density plot. Had I not saved them, I
would be using a different collection of random numbers for each plot and they
wouldnt match.
2. I used the function density() to calculate the probability density function for the values
of data. The density() function has two components, an x variable and a y variable.
The the probability density is calculated as a probability rather than as a frequency
count (as the .
Biological Data Analysis Using R
58 CHAPTER 4. SUMMARY STATISTICS
4.3 Descriptive Statistics
Descriptive statistics are valuable tools in understanding particular patterns in your
data. For the purposes of this section, we will assume that your the experiments that
are producing your data yield one of two different data types. First, observations from
your data could be considered random variables; a measurement that produces a real
number. Examples of random variables may be body size, dissolved oxygen, available
light, etc. A collection of random variables will be denoted as X with elements x
i
; i =
1 . . . N (e.g., indexing across all N individual observations). The other kind of data we
will be examining here are categorical data. Your observations are grouped into distinct
categories and consist of relative counts of each category. Examples of this include
stage-dependent demographic tallies, gender of your study organisms, some types of
genetic data, disease prevalence, etc. Categorical data will be denoted as Y , consisting
of K categories and the number of counts observed in each category will be referred to
as y
i
; i = 1 . . . K.
There are two general properties of random variables that we will spend a little time
discussing because they form the basis of how we examine our data. First, the mean
of a random variable, usually denoted by the symbol is a measure of the central
tendency of your variable (a center of gravity, so to speak). We are all familiar with
the concept of mean, but in a general sense, the mean is just one of several moments of
a distribution and now we turn to this particular moment and then discuss some of the
higher moments.
4.3.1 Moments
There are several properties of random variables that we may be interested in estimating.
Notice that here I used the term estimate rather than compute, this is on purpose. We
will be making estimates of real parameters of the data and we do so because in most
cases we do not have all the data at our disposal. Rather, we have created a sample of
our data from which we make inferences. To get all the data, we would have to sample
EVERY single instance out there and in most cases this is not possible.
There are two common properties that you will probably recognize immediately (I hope)
and use all the time. These are the mean and variance of the data and are estimated in R
using the functions: mean() and var(). Figure 4.10 shows what is being measured by these
estimators. This gure was created using the density() function from rnorm(1000000).
The mean, shown by the dashed line and the symbol is located at the center of gravity
of the data. In R, you can calculate the mean of the data by using the function mean().
The image also shows the standard deviation (which is the square root of the variance
=
2
) as indicated by the dotted line. R has a function for both the variance var(), and
the standard deviation sd().
There are two more measures of distributions that we should discuss while we are here.
2
These are the skew and kurtosis of the distribution. In R these functions are not loaded
2
Actually all four of these measures are known as the rst four moments of the distribution. The rst for
moments,
k
; k = 1 . . . 4 can be calculated by
k
= E[(X )
k
].
Biological Data Analysis Using R
4.3. DESCRIPTIVE STATISTICS 59
Figure 4.10: Example locations for rst two moments of a Normal (N(0, 1)) distribution.
into memory by default and we must load the moments library to gain access to them. To
load these libraries type:
> l i brary ( moments)
If R gives you a warning, this means that the moments library is not installed by default.
In this case, see Appendix B for instructions on how to add libraries to your installation
of R.
The skew of a distribution is a measure of how pushed-over the main lump of the
distribution (again not a very statistical denition here). Distributions can either have a
positive or negative skew, compare the images in Figure 4.11
A distribution is said to have a negative skew if the direction of the longer tail is to
the left. In these cases the mean < median < mode. Conversely, a distribution has a
positive skew if the tail is on the right and the mean > median > mode. Distributions
where these measures are equal is said to not have any skew. Skew is estimated in R
using the function skewness()
The kurtosis of a distribution is a measure of the peakedness of a distribution. This
Biological Data Analysis Using R
60 CHAPTER 4. SUMMARY STATISTICS
Figure 4.11: Negative (left) and positive (right) distributions. In both of these examples the dotted
line connects the mode of the distribution (the top peak) to the mean (on the x axis). The direction
of this lean determines if the distribution has a negative (left) or positive (right) skew.
term comes from the Greek word kurtos that means bulging. A simple example of how
kurtosis looks is found in Figure 4.12 with three different distributions (the normal,
logistic, and uniform), each with a different level of kurtosis.
In general, the function for kurtosis is:
K =
4
4
3
The correction factor (the - 3 part of the equation is a normalizing constant that allows
the kurtosis of a normal distribution to be equal to zero. Below are the raw data and the
kurtosis estimates used in producing Figure 4.12.
> normData < rnorm(100000)
> l ogi sti cData < r l ogi s (100000)
> unifData < runi f (100000)
> kurtosis ( normData) 3
[ 1] 0.02320046
> kurtosis ( l ogi sti cData ) 3
[ 1] 1.219505
> kurtosis ( unifData ) 3
[ 1] 1.197009
The discrepancy here in the estimates showing the normal distribution not quite equal to
zero is because the data were created by drawing randomnumbers rather then specifying
the distribution directly. One benet of the - 3 correction factor is that it allows you to
quickly tell the different types of kurtosis by looking at the value of the estimate. In
general, the following types of kurtosis are available:
Platykurtic Curves that have negative excess kurtosis (e.g., the kurtosis()3 < 0).
Biological Data Analysis Using R
4.3. DESCRIPTIVE STATISTICS 61
Figure 4.12: Three distributions )exponential, normal, and logistic) showing different levels of
kurtosis.
Mesokurtic Curves that do not have excess kurtosis (e.g., the kurtosis()3 = 0).
Leptokurtic Curves that have positive excess kurtosis (e.g., the kurtosis()3 > 0).
The last summary statistic we will cover here is the range(), which returns a two-item
vector containing the minimum and maximum values. In fact, the range() function calls
the min() and max() directly. There is little to discuss about this particular set of func-
tions...
Creating a matrix of Plots
It is often desireable to create more than one plot on a graphic but not overlayed on
top of each other as was explained in Section 4.1.1. To do this, we need to adjust one
of the graphics properties using the function par(). The property we need to change is
mfrow=c(nr,nc). This will create a matrix of plots that has nr rows and nc columns.
An example of creating a matrix of plots is given in the code below and depicted in Figure
4.13.
Biological Data Analysis Using R
62 CHAPTER 4. SUMMARY STATISTICS
Figure 4.13: Matrix of four plots created from random numbers sampled from the normal, pois-
son, exponential, and the logistic distributions.
> par ( mfrow=c ( 2 , 2) )
> hi st ( rnorm(100000))
> hi st ( rpoi s (100000,1))
> hi st ( rexp(100000))
> hi st ( r l ogi s (100000))
Subsequent calls to plotting functions will reuse this graphic gure and replot the
graphs in the nr x nc matrix. This graphic window will have the nr x nc matrix of plots
until it is either closed or you change the mfrow property to something else.
4.3.2 Non-Parametric Parameters
Non-parametric statistics are generally concerned with the analysis of data that does
not make assumptions about the underlying statistical distributions. There are several
commonly known non-parametric statistics such as the Binomial Test, Goodness of Fit,
the Mann-Whitney Test, and the Kruskal-Wallis test. In this section, we will explore
some of the methods that R can use to describe data without assuming an underlying
Biological Data Analysis Using R
4.4. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES 63
distribution.
The rst summary statistic outline here will be the quantile. While you have probably not
heard of this particular descriptive statistic, you most likely will have run across terms
such as a median, quartile, or percentile. All of these are particular kinds of quantiles
that will be obvious when we consider the formal denition of a quantile.
Quantile A p
th
quantile is the value x
p
that when considering the data (X) the probability
P(X < x
p
) p and the probability P(X > x
p
) = 1 p.
While this may be statsy, it generally says that the 50
th
quantile is the the value x
50
in
the distribution where 50% of the data is less than x
50
and 50% is greater than x
50
. Thus
far, you have probably call this the median (and R has a median() function if you like to
call it that). More generally though, we can consider the 95
th
quantile analogous to what
we were discussing in Section 4.1.1 when we were trying to gure out critical regions
of the
2
distribution. The main distinction here is in Section 4.1.1 we implicitly used
the known distributional form of the
2
function to nd the critical value whereas in
non-parametric approaches, we typically apply the approach of putting everything into
a vector, sorting it, and counting to where quantile is located in the list. As a result, the
50
th
quantile (or median) can be considered a measure of central tendency of the sorted
data.
Quantiles can also be used to look at the dispersion of data. In parametric statistics
we discussed parameters such as the variance and standard deviation that dene the
dispersion of values around the mean. The notion of Quantiles can be used in a similar
way. The values of x that give the upper and lower quartiles (e.g., the 25
th
and 75
th
quantiles) provide a range of the data X where the inner 50% of the values lie. These
are often called the inner quartiles of the data. To illustrate the use of the quantile
function, consider the data in Figure 4.14 consisting of 1000 numbers drawn from a
Poisson random distribution with a centrality parameter k = 5.
The quantile() function in R by default provides the 0
th
quantile (e.g., the minimum), the
25
th
quantile, the 50
th
quantile (the median), the 75
th
quantile, and the 100
th
quantile
(e.g., the maximum). For the data that produced the histogram in 4.14, the quantiles
are:
> x < rpoi s (1000,5)
> quantile ( x )
0% 25% 50% 75% 100%
0 3 5 6 12
showing that the center of dispersion is 5 and the inner quartile ranges from 3 6.
4.4 Relationships Between Pairs of Variables
There is often times when we are interested in knowing about the simultaneous changes
in two or more variables. Individually, we can estimate the mean, variance, skew, kur-
tosis, and various ranges but this does not tell us about how the variables interact
together. For this we need to look at measures that explain the relationship between
variables.
Biological Data Analysis Using R
64 CHAPTER 4. SUMMARY STATISTICS
Figure 4.14: Distribution of random number drawn from rpois(1000,5).
4.4.1 Covariance & Correlation
The covariance of two variable is dened as:
c
ij
= E[(X
X
)(Y
Y
)]
and measures the degree to which one variable X changes as another Y changes. Co-
variance estimates may be positive or negative as long as the two variables are not the
same, in which case it is a variance and there is no such thing as a negative variance.
Two variables that have a covariance equal to zero are said to be uncorrelated (although
if you dont know what a correlation is this moniker is kinda sucky).
In R the covariance between two vectors of values is estimated by the function cov().
Needless to say, the length of the two variables must be the same or R will rightly com-
plain.
> X < c(1,34,5,23,6,43,56,28,33,7)
> Y < runi f (10,1,100)
> Y
Biological Data Analysis Using R
4.4. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES 65
[ 1] 90.112843 47.236585 17.148708 3.861546 54.871332 57.234582 8.072745
[ 8] 6.000811 84.546069 17.960688
> pl ot ( X, Y)
> cov ( X, Y)
[ 1] 2231.952
Figure 4.15: Scatter plot of some semi-random points.
So here I just pounded on my numeric keypad and made up the numbers for X (not
quite random but pretty good) and then had R make some numbers for Y by drawing
from a uniform distribution runif() selecting 10 values in the range 1 100. You can see
that the values that I used produced a smattering of points (Figure 4.15 )
4.4.2 Tests For Correlation
There are parametric and non-parametric methods for looking at the relationship among
pairs of variables. In general, all correlations between two random variables (X, Y )
should have the following characteristics:
The value of a correlation is strictly bound on the interval [1, 1].
Biological Data Analysis Using R
66 CHAPTER 4. SUMMARY STATISTICS
If larger values of X tend to be associated with larger values of Y then the cor-
relation should approach +1 as the association becomes stronger. We call this a
positive correlation.
If smaller values of X tend to be associated with larger values of Y then the cor-
relation should approach 1 as the association becomes stronger. We call this a
negative correlation.
If there is no general relation between the variables X and Y then the correlation
statistic should approach 0. We call this a relationship where the variables are
uncorrelated.
The most commonly used measure of correlation is Pearsons product moment correla-
tion, r, that is calculated as:
r =
N
i=1
(X
i
x)(Y
i
y)
N
i=1
(X
i
x)
N
i=1
(Y
i
y)
(4.1)
where the x and y values are the mean of the N sampled variables in X and Y .
Figure 4.16: Example plot of two variables used to test correlations.
Biological Data Analysis Using R
4.4. RELATIONSHIPS BETWEEN PAIRS OF VARIABLES 67
In R the test for correlation is performed with the cor.test () function. To demonstrate, we
will use the following data shown in Figure 4.16:
> X < 1:20
> Y < c(17, 7, 12, 12, 4, 11, 10, 2, 35, 31, 34, 49, 27, 33, 45, 32, 36, 38, 58, 44)
> cor . t est ( X, Y)
Pearson productmoment correl ati on
data : X and Y
t = 7.3194, df = 18, pvalue = 8.489e07
al t ernat i ve hypothesis : true correl ati on i s not equal to 0
95 percent confidence i nt erval :
0.6848344 0.9456427
sample estimates :
cor
0.8651642
The correlation between these two variables is r = 0.865, which is both large and positive
as expected by looking at the graph. By default when you use cor.test () , it will use the
Pearson product moment approach. There are two additional approaches for estimating
correlation, approaches developed by Spearman and Kendal but these two are consid-
ered non-parametric methods based upon ranks rather than that shown in Eqn. 4.1
and will be left until 5.2.1 when we can fully discuss how it works. The output also
includes a signicance test and a display of the 95% condence intervals which are very
useful.
Biological Data Analysis Using R
68 CHAPTER 4. SUMMARY STATISTICS
4.5 Useful Functions
The following functions were introduced in this chapter and you will be required to use
them for the exercises. To get more information on any of these functions, use the R
help system.
dchisq(x,df) Returns the density of the
2
distribution with df degrees of freedom.
df(x,df1,df2) Returns the density of the F distribution with df1 and df2 degrees of
freedom.
dnorm(x) Returns the density of a normal distribution at x.
mean() Calculates the mean of the values in x.
pchisq(x,df) Returns the distribution of the
2
distribution with df degrees of free-
dom.
pf(x,df1,df2) Returns the distribution of the F distribution with df1 and df2 degrees
of freedom.
plot(x) This is the main wrapper function that creates a graphical display of the
variable(s) that you pass to it. Depending upon the variables passed, it will create
different types of plots.
pnorm(x) Returns the distribution of a normal distribution at x.
qchisq(x,df) Returns the quantile of the
2
distribution with df degrees of freedom.
qf(x,df1,df2) Returns the quantile of the F distribution with df1 and df2 degrees of
freedom.
qnorm(x) Returns the quantile of a normal distribution at x.
rchisq(x,df) Returns x random numbers from the
2
distribution with df degrees of
freedom.
rf(x,df1,df2) Returns x random numbers from the F distribution with df1 and df2
degrees of freedom.
rnorm(x) Returns x random numbers from the normal distribution.
sd(x) Returns the sample standard deviation of data in x.
table(f) This function takes the list of levels in the factor f and makes a table from
it.
var(x) Estimates the sample variance, s
2
, from the variables in x.
Biological Data Analysis Using R
4.6. EXERCISES 69
4.6 Exercises
The following exercises are meant to help you understand the items presented in this
Chapter.
1. What are the critical values for a
2
distribution with df = 8 if you are assuming that
= [0.2, 0.1, 0.01, 0.001]?
2. Create a scatter plot using the variables x<rnorm(10) and y<rpois(10,1). Label the axes
Jaw Size and Number of Kids.
3. For the probabilities p = seq(0.1,0.9,by=.1) create a graph that has a red line representing
the quantile function for the Poisson distribution (qpois with = 1) and a blue one
representing the quantile function for the
2
distribution (qchisq with df = 1). Make
sure to have your axes labeled and drawn properly. Save the image and include it in
your answer.
4. In a Platykurtic distribution what is the relationship between the mean, mode, and
median?
5. Create a histogram of 1000 random numbers drawn from the F-distribution with
parameters df1 = 1 & df2 = 10. On this plot, overlay the density using the density
function. Label the axes appropriately.
6. What is the inner-quartile of the data x <rnorm(200,3)?
7. Is the data from the command x <rf(1000,1,10), lepto, meso, or platykurtic? How do you
know?
8. Explain what is happening with the command data <LETTERS[ rpois(23, 2 ) ]. Create a new
variable that is a table of the results of this command, show me the table, and show
how you would access the B element in the table.
9. What is the range of possible values you can get for a Pearsons Product-Moment
Correlation?
10. There is a data set named HWCorrelationData.csv in the folder. Load this data into R ,
plot it an appropriate graphic, and then test the hypothesis H
O
:Height is independent
of Weight.
Biological Data Analysis Using R
70 CHAPTER 4. SUMMARY STATISTICS
Biological Data Analysis Using R
Chapter 5
Contingency Tables
In this chapter we will examine non-parametric methodologies that are available for
the analysis of random variables. It is not uncommon in Biology to encounter the notion
that non-parametric approaches are only to be used with categorical (e.g., nominal) data.
However, non-parametric analyses are just as applicable to normal ordinal and interval
data that we commonly come into contact with and in this Chapter we will go over a few
examples of how you can use general non-parametric statistical approaches in your own
research.
In this Chapter you will learn the following skills:
Non-parametric analysis of data single categorical data set (x
1
, x
2
, . . . , x
N
) using a
2
test.
Non-parametric analysis of paired data ( (x
1
, y
1
), (x
2
, y
2
), . . . , (x
N
, y
N
)) using the Fisher
Exact for small data and the general
2
test for large data sets.
Non-parametric analysis of several random samples using the Kruskal-Wallis test.
For most of the exercises in this chapter you will need to load the stats library by issuing
the command: library(stats).
5.1 One Random Sample
For this section, we will assume that your data consist of N observations made on a
single variable, X = [x
1
, x
2
, . . . , x
N
].
5.1.1 Goodness of Fit
The
2
test for goodness of t is the typical
2
test that we have all had a million times
as an undergraduate and a graduate student. The data for this test consists of N obser-
vations that can be categorized into K discrete Categories. In R we will use the factor
data type (see 2.4.10 for more on the factor type).
71
72 CHAPTER 5. CONTINGENCY TABLES
The assumptions of this test are:
1. All the observations are selected randomly.
2. You can assign an observation to one of the K categories without error.
The test statistic for this analysis is the calculated
2
Calc
which is:
2
Calc
=
K
i=1
(O
i
E
i
)
2
E
i
(5.1)
(5.2)
The underlying distribution of
2
Calc
will be approximated using the
2
-distribution with
K 1 degrees of freedom. From the discussion of this distribution and its depiction in
Figure 4.2, it is large values of
2
Calc
that will lead to the rejection of the null hypothesis,
H
O
.
Example Problem: Assume that we have captured a sample of the Marbled Salamander,
Ambystoma opacum, from the Rice Center for Environmental Studies (a eld station for
Virginia Commonwealth University). On each of these individuals we have classied
their marbling pattern as either Little White (N
A
= 24), Moderate White Marbling (N
B
=
47), and Mostly White (N
C
= 29). A separate crossing experiment has suggest that the
marbling on an individual may be under the control of a limited number of genetic loci
and has predicted that the frequency of these types would be 1 : 2 : 1 in populations at
equilibrium. Do the proposed mechanisms predict a distribution of phenotypes that you
sampled from the wild? To test the hypothesis, H
O
:Phenotypes occur at a ratio of 1 : 2 : 1
in R we would:
> Phenotypes < as . f actor ( c ( rep ( "Little White" , 24) ,
+ rep ( "Marbled" , 47) , rep ( "Mostly White" , 29) ) )
> p < c ( 1, 2, 1)
> p < p / sum( p) # makes p a vector of probabi l i t i es
> tabl e ( Phenotypes )
Phenotypes
Li t t l e White Marbled Mostly White
24 47 29
> chisq . t est ( tabl e ( Phenotypes ) , p = p )
Chisquared t est f or given probabi l i t i es
data : tabl e ( Phenotypes )
Xsquared = 0.86, df = 2, pvalue = 0.6505
So here, the observed and expected values were relatively close to each other producing
a
2
Calc
(in R called X-squared) of 0.86, which with df = 2 has a P-value of 0.6505. Not
something that would be considered rare. As a result, we fail to reject H
O
that the ratio
of phenotypes is 1 : 2 : 1.
Here the thing that was passed to the chisq.test function was an object of class table. This
is only one way that you can pass data to to the chisq.test function. See ?chisq.test for more
information on other ways to pass your data to this function.
Biological Data Analysis Using R
5.1. ONE RANDOM SAMPLE 73
5.1.2 Binomial Test
The binomial test evaluates the support for the probability (p) that an observation was
categorized into one of two groups. The following assumptions are inherent in the bino-
mial test:
1. Each observation has the ability to be characterized as either Category A or Cate-
gory B and the probably of assigning to A is denoted as p (and B as 1 p).
2. Each of the N observations are mutually independent.
The binomial test tests to see if the number of items you have classied as Category A
is rare given a specied probability, p. The test itself is performed using the binom.test()
function. In the example below, I am considering the situation where a coin was ipped
20 times and was found to have shown Heads only six times. The hypothesis is: H
O
:
p = 0.5. The function itself need a few pieces of data; the number of times Category A
was observed (as x), the total number of trials (as n), and the hypothesized probability p.
Call it with these data would be done as:
> binom. t est ( x=6, n=20, p=0.5 )
Exact binomial t est
data : 6 and 20
number of successes = 6, number of t r i al s = 20, pvalue = 0.1153
al t ernat i ve hypothesis : true probabi l i ty of success i s not equal to 0.5
95 percent confidence i nt erval :
0.1189316 0.5427892
sample estimates :
probabi l i ty of success
0.3
These results suggest that even with only 6 observed Heads in 20 ips, we cannot reject
H
O
that it is a fair coin. However, the 95% condence intervals show that there is a large
range of values we cannot reject...
5.1.3 General Contingency Tables
For this next application of a contingency tables we will focus on data describing the
diversity of students in the College of Humanities & Sciences at Virginia Commonwealth
University. These data are reported by all public institutions and can be found for VCU at
the webpage http://www.vcu.edu/cie/analysis/reports/sets.html and are summarized
in Table 5.1.
In general, we are going to create a contingency table that has the general form:
Col 1 Col 2 Col 3 Col c Totals
Row 1 O
11
O
12
O
13
O
1c
R
1
Row 2 O
21
O
22
O
23
O
2c
R
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Row r O
r1
O
r2
O
r3
O
rc
R
r
Totals C
1
C
2
C
3
C
c
N
Biological Data Analysis Using R
74 CHAPTER 5. CONTINGENCY TABLES
with r rows of data and c columns. Each of the entries in the rxc contingency table (the
O
ij
values) are counts of the number of observations that were classied as belonging
to the category in the i
th
row and the j
th
column. Above, when we looked at the
2
test, it was a smaller version of this table and the test statistic for analyses in general
contingency tables are the same as above:
2
Calc
=
r
i=1
c
j=1
(O
ij
E
ij
)
2
E
ij
The only distinction here is that our expected values are based upon row and column
totals such that:
E
ij
=
R
i
C
j
N
where R
i
and C
j
are the respective row and column total.
There are two specic assumptions that are required to conduct a general contingency
table test such as this:
1. The sample of N samples are drawn randomly from the larger population.
2. Each observation can be classied into exactly one of the possible r and c categories
according to single and independent criteria (e.g., there is no correlation between
the row and column variables).
Biological Data Analysis Using R
5
.
1
.
O
N
E
R
A
N
D
O
M
S
A
M
P
L
E
7
5
Table 5.1: Diversity of enrolled undergraduate students at Virginia Commonwealth University in the College of Hu-
manities & Sciences between the academic years 1998-2008 as reported by the Center for Institutional Effectiveness
(http://www.vcu.edu/cie/analysis/reports/sets.html).
Group 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Non-resident Aliens 186 158 188 208 206 235 272 375 512 577 673
Black non-Hispanic 2985 3094 3282 3332 3387 3456 3633 3797 3983 4158 4193
American Indian or Alaskan Native 91 80 83 86 90 113 109 116 124 131 131
Asian or Pacic Islander 1103 1139 1132 1175 1231 1437 1632 1764 1970 2148 2330
Hispanic 279 305 362 400 449 521 559 623 709 761 822
White, non-Hispanic 8688 8586 9013 9373 9916 10077 10757 11088 11180 11170 11202
Race/ethnicity unknown 0 188 208 279 387 665 849 928 1019 1287 1642
Total 13332 13550 14268 14853 15666 16504 17811 18691 19497 20232 20993
B
i
o
l
o
g
i
c
a
l
D
a
t
a
A
n
a
l
y
s
i
s
U
s
i
n
g
R
76 CHAPTER 5. CONTINGENCY TABLES
To demonstrate this analysis we will analyze the 1998, 2003 and 2008 enrollment data from
Table 5.1 to see if the diversity of students at VCU has changed over the last decade.
These data are present in a text le named VCUCommonData.csv in the folder for this Chapter.
It is loaded into R with the following commands.
> data < read . tabl e ( "VCUCommonData.csv" , header=T, sep=" " )
> summary( data )
Yr1998 Yr1999 Yr2000 Yr2001
Min. : 0.0 Min. : 80 Min. : 83 Min. : 86.0
1st Qu. : 138.5 1st Qu. : 173 1st Qu. : 198 1st Qu. : 243.5
Median : 279.0 Median : 305 Median : 362 Median : 400.0
Mean :1904.6 Mean :1936 Mean :2038 Mean :2121.9
3rd Qu.:2044.0 3rd Qu.:2116 3rd Qu.:2207 3rd Qu.:2253.5
Max. :8688.0 Max. :8586 Max. :9013 Max. :9373.0
Yr2002 Yr2003 Yr2004 Yr2005
Min. : 90.0 Min. : 113 Min. : 109.0 Min. : 116
1st Qu. : 296.5 1st Qu. : 378 1st Qu. : 415.5 1st Qu. : 499
Median : 449.0 Median : 665 Median : 849.0 Median : 928
Mean :2238.0 Mean : 2358 Mean : 2544.4 Mean : 2670
3rd Qu.:2309.0 3rd Qu. : 2446 3rd Qu. : 2632.5 3rd Qu. : 2780
Max. :9916.0 Max. :10077 Max. :10757.0 Max. :11088
Yr2006 Yr2007 Yr2008
Min. : 124.0 Min. : 131 Min. : 131.0
1st Qu. : 610.5 1st Qu. : 669 1st Qu. : 747.5
Median : 1019.0 Median : 1287 Median : 1642.0
Mean : 2785.3 Mean : 2890 Mean : 2999.0
3rd Qu. : 2976.5 3rd Qu. : 3153 3rd Qu. : 3261.5
Max. :11180.0 Max. :11170 Max. :11202.0
Once the entire data set is loaded into R , we can extract only the values that we are
going to use.
> Obs < as . matrix ( cbind ( data$Yr1998, data$Yr2003, data$Yr2008 ) )
> Obs
[ , 1] [ , 2] [ , 3]
[ 1 , ] 186 235 673
[ 2 , ] 2985 3456 4193
[ 3 , ] 91 113 131
[ 4 , ] 1103 1437 2330
[ 5 , ] 279 521 822
[ 6 , ] 8688 10077 11202
[ 7 , ] 0 665 1642
> colnames ( Obs ) < c ( "1998" ,"2003" ,"2008")
> rownames( Obs ) < c ( "Non-resident Aliens" , "Black non-Hispanic" ,
+ "American Indian or Alaskan Native" , "Asian or Pacific Islander" ,
+ "Hispanic" , "White, non-Hispanic" , "Race/ethnicity unknown")
> Obs
1998 2003 2008
Nonresi dent Aliens 186 235 673
Black nonHispanic 2985 3456 4193
American Indian or Alaskan Native 91 113 131
Asian or Paci f i c Isl ander 1103 1437 2330
Hispanic 279 521 822
White , nonHispanic 8688 10077 11202
Race/ethni ci ty unknown 0 665 1642
With these data we will be specically testing the hypothesis that across years there is
no differences in the relative distributions of self-identied racial and ethnic group.
In some texts, this (7x3) contingency test is called a
2
Test for Independence and in R
is conducted using the chisq.test(). To begin with, we can plot the categories as the barplot
(see 8.2.1 for how to make these plots yourself) as represented in Figure 5.1.
Biological Data Analysis Using R
5.1. ONE RANDOM SAMPLE 77
Figure 5.1: Undergraduate diversity at Virginia Commonwealth University during academic years
1998, 2003, & 2008.
> test1 < chisq . t est ( Obs )
> test1
Pearsons Chisquared t est
data : Obs
Xsquared = 1704.417, df = 12, pvalue < 2.2e16
> summary( test1 )
Length Class Mode
st at i st i c 1 none numeric
parameter 1 none numeric
p. value 1 none numeric
method 1 none character
data .name 1 none character
observed 21 none numeric
expected 21 none numeric
resi dual s 21 none numeric
Notice here that I actually assigned the results of the statistical test to the variable
test1. I did this because there are many reasons why you may be interested in looking
a various aspects of the analysis. By printing the contents of the test itself, we see that
Biological Data Analysis Using R
78 CHAPTER 5. CONTINGENCY TABLES
the calculated statstic
2
Calc
= 1704.417, which with (r 1) (c 1) = 6 2 = 12df produces
a very small Pvalue. If you look back at Figure 4.2, our observed value is way out to
the right with a very small likelihood that that you would get a value this large if it were
not signicant.
As shown using the function summary(test1) shows, the analysis itself returns a list that
has all the components as list items. There are a lot of different reasons why you may be
interested in using various components of the analysis. For example, you may want to
create a table of the observed or expected values, you may need to run this test a large
number of times and store
Caveats
There are some caveats that need to be made with respect to general use of contingency
tables. First, they are very robust as long as you have a moderate amount of samples
in each of the cells. The test statistic we have been using,
2
Calc
with (r 1) (c 1)df is
actually an approximation that is good only with good representation. If the values in the
cells are small then the approximation that we use to nd the Type I error (the value)
is poorly estimated. OK but what is moderate? Here are some general guidelines:
1
1. If any of the E
ij
estimates are less than 1 the approximation will be poor.
2. If more than 20% of the E
ij
values are less than 5 then the approximation will be
poor.
So what do you do if you have some small expected values? First, you can try to col-
lapse some of your row or column categories and recalculate. It really depends upon
your knowledge of the biology of the system if this can be done without making it a
meaningless analysis.
Second, you can try to use Fishers Exact Test. This uses combinatorial theory to esti-
mate the probabilities of the test statistic rather than asymptotic assumptions. This is
an excellent choice but has the problem that since it use combinatorial theory, at some
point you will have to perform an operation like N! which when N > 170 the computer
cannot calculate a number that large. There is also the restriction that product of the
row marginals (the R
i
values in the table) must be strictly less than 2
31
1 but he N < 170
rule is a bit easier to remember.
5.2 Paired Observations
Analyses in this section will be concerned with data that is collected in a pair-wise
fashion (e.g., for each observation, there are two values collected).
1
These guidelines are a bit on the conservative side and you may want to see a text on non-parametric
statistics for a more complete discussion of how far you can stray from these and still not get laughed at.
Biological Data Analysis Using R
5.2. PAIRED OBSERVATIONS 79
5.2.1 Rank Correlation
In 4.4.2 we looked at how you use the cor.test function to get a parametric estimate of the
correlation between two sets of variables. This is possible as well using a non-parametric
approach by adopting a ranking methodology. Non-parametric correlation methods in-
clude Spearmans and Kendals , among others but the interface in R is identical (and
the same as we already saw for the Pearson product moment correlation) so I will only
cover the Spearman approach and leave you to look into the differences.
Spearmans correlation statistic, , is calculated as:
=
N
i=1
R[X
i
]R[Y
i
] N
_
N+1
2
_
2
_
N
i=1
R[X
i
]
2
N
_
N+1
2
_
2
_1
2
_
N
i=1
R[Y
i
]
2
N
_
N+1
2
_
2
_1
2
(5.3)
where the terms R[X
i
] is the rank of the i
th
element in X. These ranks are computed
in comparison to other values in X. For example R[X
i
] = 1 is the smallest value of X,
R[X
i
] = 2 would be the second smallest, etc. So what is begin done here is that we are
replacing the actual values of the variables by the relative ranks.
Using the same data as in 4.4.2 you specify the use of the Spearman approach using
ranks by passing it as an additional option to the cor.test function.
> X < 1:20
> Y < c(17, 7, 12, 12, 4, 11, 10, 2, 35, 31, 34, 49, 27, 33, 45, 32, 36, 38, 58, 44)
> cor . t est ( X, Y, method="spearman")
Spearmans rank correl ati on rho
data : X and Y
S = 198, pvalue < 2.2e16
al t ernat i ve hypothesis : true rho i s not equal to 0
sample estimates :
rho
0.8511278
Notice here that the correlation is signicant although the correlation statistic is a bit
smaller. There is some loss of information by putting the data into ranks rather than
using the raw values.
So why use this instead of the parametric approaches? Well the calculation of Pearsons
r statistic depends upon the bivariate distribution of X and Y . If there is no known
joint distribution for these variables then the density function of r is undened. What
does this mean to you? It means that if your data can be assumed to be normal or then
go ahead and use the Pearson approach. However, if you cannot assume that they are
normal or they you know they are not, then a rank approach may be more appropriate.
For me, I consider the non-parametric approaches as appropriate for all data, whereas
the parametric ones as only good for a subset of the data that we encounter.
Biological Data Analysis Using R
80 CHAPTER 5. CONTINGENCY TABLES
5.2.2 Wilcoxon Test
The Wilcoxon test is also known as the Mann-Whitney test and a ranks based method
analogous to the a paired t-test. This approach tests the null hypothesis that samples
drawn from two different populations are essentially the same (e.g., they are as likely as
samples drawn from one or the other population). Data here are drawn randomly from
two different treatments to see if the application of either produces a signicant shift
in the values of one set of observations.
As was discussed for Spearmans , samples will be ranked in increasing order for this
analysis. If the ranks in sample X tend to be generally larger or smaller than those
observations in Y then we can reject the null hypothesis H
O
: X = Y . In general your
data should look like:
Treatment 1 Treatment 2
X
1
Y
1
X
2
Y
2
. . . . . .
X
n
Y
m
In this analysis, we do not assume that both X and Y have the same number of obser-
vations and in general will consider X to have n observations while Y has m and denote
N = n +m. Samples are lumped together and assigned ranks based upon the combined
N observations. In the case of ties where two or more samples have the exact same
value, it is recommended to assign the average rank to all the tied observations. For-
tunately for us, the internal R code takes care of this for us (and will provide warnings
when appropriate) so we can focus on our tasks and let R focus on the specics.
Assumptions
The Wilcoxon test has the following assumptions:
1. Both sets of samples (the X and Y observations) are drawn randomly form each
population.
2. There is an expected mutual independence between the X and Y values as well.
3. The variables are at least ordinal.
The test statistic for this analysis is the sum of the ranks of the X variables:
W =
n
i=1
R[X
i
]
If the observations in X and Y are drawn from a single population, as stated in the null
hypothesis, then the sum of the ranks of X should be just as large as expected for the
sum of the ranks for Y . If the treatments are producing differences in either X or Y then
the test statistic will be unusually large given N.
Biological Data Analysis Using R
5.2. PAIRED OBSERVATIONS 81
To show how to conduct the Wilcoxon test, I will use the pine germination data that is in
the folder for this Chapter. These data are from my thesis and record the average germi-
nation rates for offspring arrays of Pinus echinata families who were sampled in continu-
ous (CTRL), selectively cut (SEL), and stands where all the trees around P. echinata were
clear-cut (CLR). Here we will use the Wilcoxon to see if there is a signicant difference
in germination rates between the control (CTRL) and clear-cut treatments (CLR). Here is
how to load the data into R and extract just the treatments of interest.
> pineData < read . tabl e ( "PineGerminationData.txt" , header=T)
> summary( pineData )
GERM TRT
Min. :0.0000 CLR :15
1st Qu.:0.1800 CTRL:23
Median :0.3700 SEL :15
Mean :0.3625
3rd Qu.:0.5700
Max. :0.9400
> X < pineData$GERM[ pineData$TRT=="CLR" ]
> Y < pineData$GERM[ pineData$TRT=="CTRL" ]
> length (X)
[ 1] 15
> length ( Y)
[ 1] 23
> X
[ 1] 0.67 0.64 0.94 0.40 0.01 0.45 0.58 0.00 0.80 0.81 0.21 0.36 0.82 0.35 0.41
> Y
[ 1] 0.63 0.29 0.37 0.56 0.19 0.02 0.06 0.07 0.11 0.18 0.03 0.64 0.21 0.00 0.00
[ 16] 0.53 0.00 0.00 0.00 0.00 0.35 0.39 0.37
> mean(X)
[ 1] 0.4966667
> mean( Y)
[ 1] 0.2173913
> range (X)
[ 1] 0.00 0.94
> range ( Y)
[ 1] 0.00 0.64
You can see that there are different numbers of samples in each treatment but that they
have overlapping ranges. To run the Wilcoxon test, use the function wilcox.test and pass it
the two variables.
> wilcox . t est ( X, Y)
Wilcoxon rank sum t est with conti nui ty correcti on
data : X and Y
W = 269.5, pvalue = 0.003835
al t ernat i ve hypothesis : true l ocati on shi f t i s not equal to 0
Warning message:
In wilcox . t est . def aul t ( X, Y) : cannot compute exact pvalue with t i es
According to our test, the data in X and Y appear to be different. The test statistic, W =
269.5 which gives it a P-value of 0.004. There are some error messages that you should
be aware of. Apparently in the data, there were ties and this causes some problems
in calculating the signicance of the parameter. These ties are for families that did not
produce any offspring. From a biological perspective, these are valid responses and you
would have to just live with the fact that ties existed because throwing out all the 0.00
values changes the interpretation of what happened.
Biological Data Analysis Using R
82 CHAPTER 5. CONTINGENCY TABLES
In general, the Wilcoxon test is rather powerful in determining the equality of samples
drawn from two different populations. It is essentially the non-parametric version of the
normal t-test.
2
Situations where you may favor a Wilcoxon approach over the t-test are
when you have non-normal data or data with several outlier points.
5.3 Several Random Samples
The nal section in this chapter is focused on data that is collected from multiple treat-
ments. In the previous discussion of the Wilcoxon test, the data had k = 2 treatments
and it was introduced as a rank based analog of the t-test. Here we will introduce the
Kruskal-Wallis test which allows for the analysis of k > 2 treatments and we could again
consider it a rank-based analog of an analysis of variance (ANOVA) approach.
5.3.1 Kruskal-Wallis Tests
The Kruskal-Wallis test examines the differences among k different treatments using a
rank-based approach similar to that discussed for the Wilcoxon test. In fact, this test is
just an extension of the Wilcoxon test using the same sum or ranks approach.
Data for this test is not assumed to be of equal sizes. Each treatment may have a
different number of observations in it with a total sample size of: N =
k
i=1
n
i
. You
should be able to make a list of your data by treatment such as:
Treatment 1 Treatment 2 Treatment k
X
11
X
21
X
k1
X
12
X
22
X
k2
.
.
.
.
.
.
.
.
.
.
.
.
X
1n
1
X
2n
2
X
kn
k
The test statistic for this test is a
2
approximation with k 1 degrees of freedom
Assumptions
There are several assumptions associated with this test:
1. All samples are randomly drawn from their perspective treatments.
2. Treatments are independent of each other.
3. The observations are at least ordinal in nature.
As an example using this analysis, we will examine the same Pinus echinata data set
that we used to demonstrate the Wilcoxon test. The default method for performing this
analysis looks like kruskal.test(x, g, ...) where the variable x is the raw data and the g one is
another variable that has the groupings. In the code below I separate out the variables
2
Actually if you do a t-test on the ranks you will get the same answer as the Wilcoxon, the approaches
are identical except for how the data are encoded; raw or as ranks.
Biological Data Analysis Using R
5.4. THE FORMULA NOTATION & BOX PLOTS 83
and then pass them to the function with Germination as the response and grouped by the
factor Treatment. I also conduct the analysis and assign it to the variable named germTest
so you can see that this analysis also returns a list of results.
> pineData < read . tabl e ( "PineGerminationData.txt" , header=T)
> GerminationRates < pineData$GERM
> Treatment < as . f actor ( pineData$TRT )
> Treatment
[ 1] CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL
[ 16] CTRL CTRL CTRL CTRL CTRL CTRL CTRL CTRL SEL SEL SEL SEL SEL SEL SEL
[ 31] SEL SEL SEL SEL SEL SEL SEL SEL CLR CLR CLR CLR CLR CLR CLR
[ 46] CLR CLR CLR CLR CLR CLR CLR CLR
Levels : CLR CTRL SEL
> GerminationRates
[ 1] 0.630 0.290 0.370 0.560 0.190 0.020 0.060 0.070 0.110 0.180 0.030 0.640
[ 13] 0.210 0.000 0.000 0.530 0.000 0.000 0.000 0.000 0.350 0.390 0.370 0.580
[ 25] 0.490 0.450 0.380 0.510 0.570 0.240 0.290 0.620 0.520 0.200 0.240 0.615
[ 37] 0.760 0.300 0.670 0.640 0.940 0.400 0.010 0.450 0.580 0.000 0.800 0.810
[ 49] 0.210 0.360 0.820 0.350 0.410
> germTest < kruskal . t est ( GerminationRates , Treatment )
> summary( germTest )
Length Class Mode
st at i st i c 1 none numeric
parameter 1 none numeric
p. value 1 none numeric
method 1 none character
data .name 1 none character
> germTest
KruskalWal l i s rank sum t est
data : GerminationRates and Treatment
KruskalWal l i s chisquared = 12.539, df = 2, pvalue = 0.001893
When looking at the results of the test, we see that the estimated test statistic was
relatively large suggesting that it is unlikely that the three timber extraction treatments
do not differentially inuence the germination percentages.
5.4 The Formula Notation & Box Plots
If you look at the function signature for the kruskal.test (by typing ?kruskal.test into R ), you
can see several alternate ways you can pass your data to it.
kruskal . t est package : stats R Documentation
KruskalWal l i s Rank Sum Test
Description :
Performs a KruskalWal l i s rank sum t est .
Usage:
kruskal . t est ( x, . . . )
## Default S3 method:
kruskal . t est ( x, g , . . . )
## S3 method f or cl ass formula :
kruskal . t est ( formula , data , subset , na. action , . . . )
Biological Data Analysis Using R
84 CHAPTER 5. CONTINGENCY TABLES
When discussing the relationship between the raw germination data and the grouping
variable, I used the statement ...is a function of... This notation is the formula notation
that is indicated in the last option for calling the kruskal.test function. In R you can often
use the formula notation to perform analyses and plots and here we will spend a little
bit of time on how that is done. In Chapter 6 you will use this notation quite a bit when
writing out linear models.
The formula notation in R consists of the response variable (or variables that Ill call
Y ), the predictor variable (or variables which will be denoted as X), and the tilde sign
showing the relationship. For example, a simple function would be denoted as Y X
stating that Y is a function of X. Using the function notation for the kruskal.test would
look like:
> kruskal . t est ( GerminationRates Treatment )
KruskalWal l i s rank sum t est
data : GerminationRates by Treatment
KruskalWal l i s chisquared = 12.539, df = 2, pvalue = 0.001893
Figure 5.2: Boxplot of Pinus echinata germination data partitioned by timber extraction treatment.
Biological Data Analysis Using R
5.4. THE FORMULA NOTATION & BOX PLOTS 85
It is even possible (and perhaps better because we are rather lazy in our typing) to use
the function notation of the variable names within a data.frame without having to make the
other variables (GerminationRates and Treatments). However, when you do this, you will have
to pass an additional parameter to the analysis function to tell it which data to look into
for those variable names. For example, with the pineData data set you can type:
> kruskal . t est ( GERM TRT, data=pineData )
KruskalWal l i s rank sum t est
data : GERM by TRT
KruskalWal l i s chisquared = 12.539, df = 2, pvalue = 0.001893
Another common place to nd the function notation is in plotting. Thus far, we have
called scatter plots by the function plot(x,y). It is just as easy to call the plot as plot(y x)
and you will get the same results if the variable x is a continuous variable. However, if
x is a categorical variable you will not get a normal scatter plot. What you will get is a
box plot as depicted in Figure 5.2 which was created by calling the function
3
:
> pl ot (GERM TRT, data=data , xlab="Treatment" , ylab="GerminationRate")
3
To adjust additional parameters on the box plots see the function bxp which is the actual plotting
function that the plot function is handing the data off to. You can adjust many other components of the plot
including notches, box colors, etc.
Biological Data Analysis Using R
86 CHAPTER 5. CONTINGENCY TABLES
5.5 Useful Functions
The following functions were introduced in this chapter and you will be required to use
them for the exercises. To get more information on any of these functions, use the R
help system.
as.factor(x) Coerces the data in x into a factor data type.
as.matrix(x) Coerces the data in x into a matrix data type if possible.
binom.test(x,n,p) Performs a binomial test to see if observing x occurrances of one
category of data in n trials is consistent with the likelihood of it occuring with a
frequency of p.
c(x,y) The concatinate function that munges all the items together and returns
them as a vector.
cbind(x,y,...) Binds together the data in x, y, etc. by columns.
colnames(x) Access the column names in the item x. This only works for matrices
and data.frames.
cor.test(x) Tests for a signicant (e.g., = 0) correlations.
chisq.test(t) Performs a
2
test on the values in the table t.
kruskal.test(x,g) Performs the Kruskal-Wallis Rank Sum test for the data in x as
partitioned into groups dened by g.
length(x) Returns the length of x.
mean(x) Returns the mean of the items in x.
range(x) Returns a two-element vector containing the minimum and maximum val-
ues in x.
read.table() Reads in a raw data into R .
rownames(x) Access the row names in x. This only works ofr matrices or data.frames.
summary(x) Returns a general summary of the data in x.
table(f) This function takes the list of levels in the factor f and makes a table from
it.
wilcox.test(x,y) Performs the Wilcoxon Rank Sum Test on the variables in x and y.
Biological Data Analysis Using R
5.6. EXERCISES 87
5.6 Exercises
The following exercise are meant to help you understand the items presented in this
Chapter
1. Calculate the relative proportions of each group in the 1999 VCU data and use the
goodness of t approach (as in 5.1.1 to see if the 2008 student class has the same
relative proportions as are predicted by the 1999 class.
2. Compare the enrollment freshmen enrollment in the College of Humanities & Sciences
at VCU (from Table 5.1) during the 2006-2007 academic year for Degree-Seeking Un-
dergraduates to the three Universities listed below. Is the student diversity across
these institutions the same? These data sets are prepared each academic year by
each public institution and can be found by searching for Common Data Sets and
looking at Enrollment & Persistence. Below are the places you can get this informa-
tion for three Universities in our region.
Auburn University https://oira.auburn.edu/cds/2006/sectionb.aspx
University of Virginia http://www.web.virginia.edu/IAAS/data catalog/institutional/cds/current/enrollment.htm
Virginia Tech http://www.ir.vt.edu/common ds 2006.htm
3. Use the wilcoxon.test to see if the germination rates observed in the SEL and CLR treat-
ments are signicantly different. Provide some interpretation of your results.
4. Load the data into R that is found in the le CornOutput.csv (Note: this data is tab-
delimited so you will have to adjust the separator you use in the read.table function),
These data represent the output in numbers of bushels per acre of corn with three dif-
ferent fertilizer treatments. Create a density plot showing the distribution of bushels
yielded by each treatment.
5. Test the equality of the fertilizers in the data loaded from the last question using a
Kruskal-Wallis test. Interpret your results.
6. What are the inner-quartiles of the three fertilizer yields?
7. From a total of N = 15 students in this course, if 14 pass, is the probability of passing
this course equal to p = 0.65?
8. What does the optional parameter rescale.p change in the chisq.test function? Why would
you want to use this option?
9. Assume that you observed phenotypes in the following amounts: n
spots
= 12 individ-
uals with spots, n
silky
= 22 with silky fur, n
Smooth
= 15 smooth coated, and n
aguti
= 8
aguti. Do these data t the hypothesis that the probability of any one of these phe-
notypes is equal?
10. Create a data three variables named First, Second, and Third and assign each of them
the value of runif(3). Now, create a bar plot of these data assuming that the rst entry
in each data set represents Category A, the second Category B, and the third Category C.
Make it look something like Figure 5.1 with the Categories used as the partitioning
variable along the x-axis. Feel free to provide your own colors.
Biological Data Analysis Using R
88 CHAPTER 5. CONTINGENCY TABLES
Biological Data Analysis Using R
Chapter 6
Linear Models
This chapter focuses on the analysis of linear models in R . The term linear model is a
general one that will be used a bit loosely. In general, a linear models is one that can be
written down in the form:
y = x
Some variable, or set of variables, y, are predicted to have a particular relationship with
some predictor variable (or variables) denoted in x. In the simplest case when both x
and y are continuous variables, the analysis is called a regression analysis, if x has
more than one predictor variable then it is called a multiple regression, and if y is binary
it is a logistic regression. However, if the predictor variable is categorical the model
is called an analysis of variance with many variants depending upon the number and
relationship of categorical predictor variables in x. Finally, if predictor variables consist
of categorical and continuous variables then it is called an analysis of covariance. There
are many different ways of introducing these different kinds of analysis but we are going
to focus on the functional form and the kinds of variables that make up the predictor
x.
In this Chapter you will learn the following skills:
Learn to analyze data using a simple regression approach.
Be able to incrementally build a multiple regression model using Type III sums of
squares.
Perform an analysis of variance (ANOVA) analysis for both 1-way and factorial mod-
els.
6.1 The t-test
6.1.1 One-Sample Tests
The rst linear model we will deal with is the t-test. The functional form of this is:
89
90 CHAPTER 6. LINEAR MODELS
y =
where we believe that the observations sampled in y have some particular mean value
and the variation around that mean value is simply the natural variation there is is the
kind of samples we are measuring. The function that performs the one-sample ttest in
R is (not surprisingly) called t.test and has the following options available to it.
t . t est ( x, y = NULL,
al t ernat i ve = c ( "two.sided" , "less" , "greater") ,
mu = 0, paired = FALSE, var . equal = FALSE,
conf . l evel = 0.95, . . . )
For a one-sampled test, we will pass the response variable and a value for the parameter
mu to the function. By default, it will test the null hypothesis H
O
: y = (the mu in
the signature) using a two.sided alternate hypothesis. This means that we can reject
the null if y < and if y > using a
2
rejection region. If you have reason to believe
that the observations are supposed to increase or decrease over some particular value,
something along the lines of say the addition of fertilizer should increase yield, then
you should be using a one-tailed test instead that only examines an -sized region one
end.
In the data below, we are testing the hypothesis that H
O
: y = 15 with the given data.
> Y < c(19,25,14,15,24,17,19,27,29,25)
> test1 < t . t est ( Y,mu=15)
> summary( test1 )
Length Class Mode
st at i st i c 1 none numeric
parameter 1 none numeric
p. value 1 none numeric
conf . i nt 2 none numeric
estimate 1 none numeric
nul l . value 1 none numeric
al t ernat i ve 1 none character
method 1 none character
data .name 1 none character
> pri nt ( test1 )
One Sample tt est
data : Y
t = 3.8523, df = 9, pvalue = 0.003892
al t ernat i ve hypothesis : true mean i s not equal to 15
95 percent confidence i nt erval :
17.64182 25.15818
sample estimates :
mean of x
21.4
You can see that I assigned the results of the analysis to the variable named test1. Just
as in the contingency tables examples (5.1.3 & 5.2.2) the results of an analysis are a
list containing all the parameters that were used to perform the analysis as well as
intermediary materials and results. Of particular mention are the parameters p.value,
conf.int, and statistic. Overall, the analysis found that we can reject the null hypothesis
H
O
: y = 15 with a P-value of 0.004. This is fairly good support for the notion that the
mean of these observations is not equal to 15.
Biological Data Analysis Using R
6.2. REGRESSION WITH A SINGLE VARIABLE 91
6.1.2 Paired Tests
The t-test can also be used in a paired fashion. This analysis consists of two sets of
variables, X and Y that are observations that are taken in such a manner as to think
that the differences between them are negligible. For example, perhaps you think that
parasite load has inuenced the development of young warblers so you measure the
lengths of the primary feathers. Overall the null hypothesis for this is H
O
: X = Y .
Another way to write this hypothesis is: H
O
: (X Y ) = 0, in which case this becomes
identical to the one-sampled test. An example of this in R (with entirely contrived data)
would be:
> X < round( runi f (10,min=12,max=20) )
> Y < round( runi f (10,min=12,max=20) )
> X
[ 1] 12 18 18 13 14 15 15 16 17 19
> Y
[ 1] 14 17 20 13 17 12 16 17 17 15
> t . t est ( X, Y, paired=T)
Paired tt est
data : X and Y
t = 0.1416, df = 9, pvalue = 0.8905
al t ernat i ve hypothesis : true di f f erence in means i s not equal to 0
95 percent confidence i nt erval :
1.697808 1.497808
sample estimates :
mean of the di f f erences
0.1
Notice that since these are paired, they must be taken from the same experimental unit,
which is why we added the paired=T option to the parameters we passed to t.test.
6.2 Regression With A Single Variable
A linear regression seeks to see if the values in the response variable y can be predicted
to change systematically with the predictor variable x. The general form of a regression
model is:
y
ij
=
0
+
1
x
i
+e
j
where the response variable y
ij
is hypothesized to be a function of three independent
components:
1. The intercept,
0
.
2. A slope coefcient,
1
that determines at what rate y changes with changes in x.
3. The error term, e
j
, is the latent variation that every observed value has around the
predicted regression line.
The methods by which the parameters
0
and
1
are estimated are varied. The most
common approach is the least squares approach which tries to nd estimates for these
Biological Data Analysis Using R
92 CHAPTER 6. LINEAR MODELS
two parameters that minimizes the sum of squared error terms (e.g.,
N
i=1
e
i
). In R we
can use the function lm to construct the linear model. Here is an example data set with
the values plotted in Figure 6.1.
Figure 6.1: Plot of single variable regression values.
> X < 1:10
> X
[ 1] 1 2 3 4 5 6 7 8 9 10
> Y < c(19,25,14,15,24,17,19,27,29,25)
> Y
[ 1] 19 25 14 15 24 17 19 27 29 25
> pl ot ( YX, xlab="X" , ylab="Y" , bty="n" , col ="red" ,pch=19, ylim=c( 0 , 30) , xlim=c ( 0 , 10) )
To plot these, I used the functional form (see 5.4 for a discussion of how this works)
with Y X, set the labels, the plot colors, the ranges of the x and yaxes, and the
plot characters with the pch option.
1
By eye-balling the image, do you think there is a
relationship between these variables?
> f i t 1 < lm( YX)
> f i t 1
1
To see all the different characters that you can use as plot symbols type plot(1:25,pch=1:25) and it will
plot each symbol along the x = y line.
Biological Data Analysis Using R
6.2. REGRESSION WITH A SINGLE VARIABLE 93
Cal l :
lm( formula = Y X)
Coef f i ci ent s :
( I ntercept ) X
16.3333 0.9212
I start by assigning the response of the analysis to the variable fit1. Printing the contents
of the analysis shows that the intercept term (the
0
) has been estimated to be 16.333
whereas the slope term (R calls this by the variable name you use for it and above we
called it
1
) as 0.92. So for each increment of X, there is almost a corresponding increase
in Y (OK since the points do kinda point upwards). But is this signicant? You can have
a non-zero estimate for a non-signicant relationship. To see a slightly more detailed
printout of the components in fit1 use the summary function.
> summary( f i t 1 )
Cal l :
lm( formula = Y X)
Residuals :
Min 1Q Median 3Q Max
5.097 4.591 0.600 3.238 6.824
Coef f i ci ent s :
Estimate Std. Error t value Pr( >| t | )
( I ntercept ) 16.3333 3.2258 5.063 0.000973
X 0.9212 0.5199 1.772 0.114348
Here you have up to k different predictor variables, each of which contributing to the
observed value in y. When approaching a multiple regression,
The null hypothesis for a multiple regression is H
O
:
i
= 0; i and states that all the beta
regression terms are zero. To address this hypothesis, we build a linear model and then
determine how much of the observed variation can be explained by the model in.
In R we can use the same lm function as for a single predictor regression but this time
we need change how we put the function equation into it to accommodate two variables.
For this example, we can use the data shown in Table 6.3.
i Y X
1
X
2
1 4.26 1.00 0.89
2 20.74 2.00 0.41
3 14.95 3.00 0.72
4 -5.55 4.00 0.20
5 21.29 5.00 0.40
6 33.49 6.00 0.37
7 32.15 7.00 0.61
8 45.95 8.00 0.09
9 38.94 9.00 0.74
10 48.27 10.00 0.69
These values can be put into R as:
> Y < c( 4. 26 , 30.74, 14.95, 5.55, 21.29, 33.49, 32.15, 45.95, 38.94, 48.27)
> X1 < 1:10
> X2 < c( 0. 88 , 0.41, 0.72, 0.19, 0.40, 0.37, 0.61, 0.09, 0.74, 0.68)
> cbind ( Y, X1,X2)
Y X1 X2
[ 1 , ] 4.26 1 0.88
[ 2 , ] 30.74 2 0.41
[ 3 , ] 14.95 3 0.72
[ 4 , ] 5.55 4 0.19
[ 5 , ] 21.29 5 0.40
[ 6 , ] 33.49 6 0.37
[ 7 , ] 32.15 7 0.61
[ 8 , ] 45.95 8 0.09
[ 9 , ] 38.94 9 0.74
[ 10 , ] 48.27 10 0.68
And then we can create a linear model using the notation lm( Y X1 + X2 ).
> f i t 2 < lm( Y X1 + X2 )
> summary( f i t 2 )
Cal l :
lm( formula = Y X1 + X2)
Residuals :
Min 1Q Median 3Q Max
24.8394 2.7430 0.8989 4.1369 20.0461
Biological Data Analysis Using R
6.3. MULTIPLE REGRESSION 99
Coef f i ci ent s :
Estimate Std. Error t value Pr( >| t | )
( I ntercept ) 1.170 12.801 0.091 0.9297
X1 4.460 1.422 3.137 0.0164
X2 1.473 16.763 0.088 0.9324
Control
=
Selective
=
ClearCut
).
> pineData < read . tabl e ( "PineGerminationData.txt" , header=T)
> anova1 < aov ( GERM TRT, data=pineData )
> anova1
Cal l :
Biological Data Analysis Using R
104 CHAPTER 6. LINEAR MODELS
Figure 6.5: Boxplot of germination percentages for Pinus echinata as a function of treatment. A
colored rug was added to the right side to show the actual values within treatments (see rug.
aov ( formula = GERM TRT, data = pineData )
Terms:
TRT Residuals
Sum of Squares 0.8717943 2.6520868
Deg. of Freedom 2 50
Residual standard error : 0.2303079
Estimated ef f ect s may be unbalanced
> anova( anova1)
Analysis of Variance Table
Response : GERM
Df Sum Sq Mean Sq F value Pr(>F)
TRT 2 0.87179 0.43590 8.218 0.0008207
Residuals 50 2.65209 0.05304
.
2. Load the data set ClutchSizes.csv from the le. Using a paired t-test, test the hypoth-
esis H
O
: There is no difference in reproductive output between habitat types.
3. Load the data le, SingleRegresssion.RData from the le into R
2
. Fit the regression
model, Y X. Is it signicant? Show the regression equation and the anova table.
4. Plot the regression model fromthe previous example and indicate the tted regression
line with a dotted red line in the plot.
5. Fromthe single regression model, add the regression equation to the graph indicating
the coefcients that were estimated.
6. Does a plot of the residuals as a function of the predicted values from the estimated
regression model suggest that the model is appropriate?
7. Load the data set MultipleRegression.RData from the le, it will contain a data frame
named multReg. Use the variables in this data frame, Y,X1,X2,X3 to t a multiple re-
gression model. Show the summary and the anova table in your results. What is the
predicted regression equation?
8. Fit another model to the multReg data that has all the interaction terms amongst the
X predictor variables. Use the anova procedure to see which of these models is more
appropriate.
9. Load the data le VarroaCounts.RData, it will be a data frame named BeeData. These
data represent counts of the parasite Varroa destructor a common pest of domesti-
cated honey bees. Test the hypothesis using an analysis of variance that there is no
difference in mite counts between the different lines of bees.
10. Perform the TukeyHSD test on the parasite data from the previous question.
2
Use the load function.
Biological Data Analysis Using R
Chapter 7
Working With Images
In this chapter, you will focus on the following topics:
Gain a basic understanding of open image formats
Learn how to import image data into R
Manipulate image data at the pixel level.
7.1 Image Data
There are several different methods that are available to you to import image data into R
. As I was writing this document over Winter break and updating it in the fall, the main
image processing library for R , rimage, was broken and could caused a few problems
when installed. I am sure it will be xed in the near future and recommend that you
look at that library when you next have the need to do some image manipulation because
it has a lot of funcitonality. However, at the present, it is not going to be used. The
consequences of not having rimage is that it appears that importing jpeg, tiff, and bmp
image formats is beyond our grasp. Lucky for us, there are a ton of other image formats
out there and we can easily convert the image shown in Figure 11.1 into another format
and use it just as easily. Perhaps when I update this manuscript the next time around,
Ill change this section. I think it is also important that you understand the internal
workings of images and for right now, these more simple image formats will serve our
purposes nicely and everything you learn here will be easily transferable to those other
image formats when you need to deal with them in the future.
7.1.1 PNM Image Format
Images on computers have specic formats in which the color information and other
meta data is stored in the le. Some of the methods are relatively easy to use and can be
manipulated directly in a text editor. Others are more of a pain and some are owned by
some company who has patented the way the information is stored in the le and you
have to pay royalties to them to view it. For example, the ubiquitous GIF image format
109
110 CHAPTER 7. WORKING WITH IMAGES
uses an algorithm that was patented and owned by a company and if you were to write
a viewer for it in some countries you would have to pay a royalty to use it... Lame.
The PNM image format (short for portable anymap) is an open format for the exchange
of image information. Actually, there are three different formats that fall under the PNM
specication as detailed below.
Portable Bitmap Format (PBM)
This format stores bitmaps images. A bitmap can be thought of as an image whose pixels
are either turned on or off (say black and white). The representation of a PBM le can be
given as a simple text le with the extension .pbm. An example text le for a bitmap le
that encodes for the uppercase letter R would be:
P1
# This is an example bit map file r.pbm
5 8
1 1 1 1 0
1 0 0 0 1
1 0 0 0 1
1 0 0 0 1
1 1 1 1 0
1 0 0 1 0
1 0 0 0 1
1 0 0 0 1
In this le, the rst line is a special code to tell the computer how many bits per pixel to
use. The second line is a comment line that you can put anything you like into (but has
to start with the # character). The third line tells how many columns and rows of data
that the image has. Note, this is a column-major notation here where the rst number
is the number of columns and the second number is the number of rows, which is the
opposite of which we use (row-major) in R for interacting with matrices of data. The rest
of the le consists of the actual bit matrix where 1 represents a pixel that is turned on
and 0 represents a pixel that is turned off. The image represented in this le is given in
Figure 7.1.
You can make this image programatically, by creating the matrix in R and using the
image function. Here is an example creating the image of the letter T.
> x < matrix ( 0 , nrow=8, ncol =5)
> x
[ , 1] [ , 2] [ , 3] [ , 4] [ , 5]
[ 1 , ] 0 0 0 0 0
[ 2 , ] 0 0 0 0 0
[ 3 , ] 0 0 0 0 0
[ 4 , ] 0 0 0 0 0
[ 5 , ] 0 0 0 0 0
[ 6 , ] 0 0 0 0 0
[ 7 , ] 0 0 0 0 0
[ 8 , ] 0 0 0 0 0
> x[ 1 , ] < 1
> x[ , 3] < 1
> x
Biological Data Analysis Using R
7.1. IMAGE DATA 111
Figure 7.1: The image represented in the r.pbm le. This image has been scaled up to make it
large enough to see it on the page using the program GIMP (www.gimp.org).
[ , 1] [ , 2] [ , 3] [ , 4] [ , 5]
[ 1 , ] 1 1 1 1 1
[ 2 , ] 0 0 1 0 0
[ 3 , ] 0 0 1 0 0
[ 4 , ] 0 0 1 0 0
[ 5 , ] 0 0 1 0 0
[ 6 , ] 0 0 1 0 0
[ 7 , ] 0 0 1 0 0
[ 8 , ] 0 0 1 0 0
> col ors < c ( "black" ,"grey")
> image ( x, col =colors , axes=F)
Here I created the matrix that had all 0 in it and set the top row and the middle column
equal to 1. Then the image function was used to plot it. The image function takes a number
of optional arguments and here I have supplied it the colors and the option to not show
the axes. Since I have two values in the matrix, a two element vector will be sufcient to
handle all the different colors. The image shown in Figure 7.2 shows this matrix. There
seems to be a small problem with it in that it is rotated 90
counter-clockwise. This is
because the origin of the plot that is created by the image function is in the lower left-hand
corner. Conversely, most images that are stored on the computer (like the desktop image
in the background), assume that the origin is at the upper left hand corner of the image.
Obviously these two do not mesh well together.
Portable Graymap Format (PGM)
This format is for graymap images where the term graymap refers to the lack of color
in the image. In terms of complexity, this is slightly more information contained in the
data le as each pixel is not either ON or OFF, rather there is a percentage of ONNESS... (is
that a word?).
P2
# The PGM file for dog.pgm
24 7
5
Biological Data Analysis Using R
112 CHAPTER 7. WORKING WITH IMAGES
Figure 7.2: A PBM le that was programatically created in R . The image is rotated because of the
default location of the origin.
0 1 1 1 1 0 0 0 0 0 0 5 5 5 0 0 0 0 0 4 4 4 0 0
0 1 0 0 0 1 0 0 0 0 5 0 0 0 5 0 0 0 4 0 0 0 4 0
0 1 0 0 0 0 1 0 0 5 0 0 0 0 0 5 0 0 4 0 0 0 0 0
0 1 0 0 0 0 1 0 0 5 0 0 0 0 0 5 0 0 4 0 0 0 0 0
0 1 0 0 0 0 1 0 0 5 0 0 0 0 0 5 0 0 4 0 0 4 4 0
0 1 0 0 0 1 0 0 0 0 5 0 0 0 5 0 0 0 4 0 0 0 4 0
0 1 1 1 1 0 0 0 0 0 0 5 5 5 0 0 0 0 0 4 4 4 0 0
The rst three lines of the le are the same as for the PBM format. The fourth line in the
le gives the maximum value representing the the most white in the image. In this case,
the a black pixel will be represented by the number 0 and the white would be represented
by 5 and values in between would be
1
5
increments of whiteness. The remaining portions
of the le have the actual image represented in a pixel-by-pixel matrix of values. You
can see that the majority of the image is
0
5
black and the letters are varying shades of
gray (Figure 7.3).
The number of shades of gray you use in a PGM le is up to you as long as it does not
exceed 255 (I think). These are easy les to create and you could imagine how you could
Biological Data Analysis Using R
7.1. IMAGE DATA 113
Figure 7.3: The image represented by the dog.pgm le. This image has been scaled up to make it
large enough to see it on the page using the program GIMP (www.gimp.org).
create a matrix of integers from some analysis and save it as a pgm le and view it
directly.
Portable Pixmap Format (PPM)
The last le format, PPM, is one that handles pixmaps, which means that you have colored
pixels in the image. The le format is identical to that of the PGM with the exception that
the code on the rst line is P3, which represents 24-bits per pixel; 8 of which are for red,
8 for green, and 8 for blue. An example of the PPM le shown in Figure 7.4 is:
P3
# This image contains an image of my daughter Libbie (from Libbie.ppm).
180 240
255
188
219
253
189
220
252
In this le, the pixel values are placed one per line instead of next to each other. Starting
at line number 5 with a value of 188 the following 180x240 = 43, 200 lines contain an
integer whose value is between 0 and 255 (the maximum all color as depicted on line
4) for the color red followed by another 43, 200 lines of numbers for the color green, and
then another 43, 200 lines for the blue. When we begin looking at manipulating images
you will nd that you can interact with each color channel independently.
One drawback to these image formats are that they are not very efcient. For example,
the image of my daughter in Figure 7.4 has 129, 604 lines of information in it, which on
my computer makes it 465K in size. The exact same image saved as a jpeg le is only
25K in size. The compression used to make jpeg, tiff, gif, png, and other compressed le
formats is why they are used on the internet. But for our purposes, the lack compression
and inefciency in storages sizes are relatively irrelevant.
Biological Data Analysis Using R
114 CHAPTER 7. WORKING WITH IMAGES
Figure 7.4: The image represented in the Libbie.ppm le. This image has been scaled up to make
it large enough to see it on the page using the program GIMP (www.gimp.org).
7.2 Loading The Image Into R
OK, now that the basics of how one kind of image is represented in the data les, it is
time to load one into R and see what we have to work with. To load a PNM le, you must
rst import the pixmap library then you can use the function read.pnm() to load the le into
a local variable and plot it using the plot () function.
> l i brary ( pixmap)
> photo < read .pnm( f i l e ="Libbie.ppm")
Read 129600 items
> pl ot ( photo )
The plot () function will open a new image window and show the loaded image.
7.3 Components of A Pixmap
We can learn a little bit more about what kind of data type the variable we call photo is
by using the class() function.
> cl ass ( photo )
[ 1] "pixmapRGB"
at t r ( , "package")
[ 1] "pixmap"
> names( attri butes ( photo ) )
[ 1] "size" "cellres" "bbox" "bbcent" "channels" "red" "green"
[ 8] "blue" "class"
Biological Data Analysis Using R
7.4. IMAGE OPERATIONS 115
This variable is a pixmapRGB class that comes from the pixmap package. A class is a self con-
tained data structure that has both attributes and data. The command names(attributes(photo))
tells us the names of the attributes that the variable has.
There are some issues that we should touch on when dealing with classes. They differ
from what we have been using thus far such as data frames in that we cannot access
the contents of a class using the $ notation. This is because things like lists and data
frames are not classes, they are just objects. To access attributes of classes we use the
notation. For example:
> photo@size
[ 1] 240 180
> photo@channels
[ 1] "red" "green" "blue"
> dim( photo@red )
[ 1] 240 180
> photo@red[ 1 , 1]
[ 1] 0.7372549
> range ( photo@red)
[ 1] 0 1
Here we can get to the size, channels, and red components of the class directly. We can also
see that the red channel that determines the amount of redness in each pixel has been
standardized on the range [0, 1]. This is important to know if we are going to manipulate
the image directly.
7.4 Image Operations
7.4.1 Extracting Channels
So now we know how to make some alterations of the image and see what happens. In
the next example, I rst copy the photo to make three additional photos, named redPhoto,
bluePhoto, and greenPhoto. Then for each of the new variables I remove all the data in each
of the corresponding channels by making the channel contain a matrix of zeros the same
size as the original matrix.
> redPhoto < photo
> bluePhoto < photo
> greenPhoto < photo
> redPhoto@size
[ 1] 240 180
> redPhoto@blue < redPhoto@green < matrix ( 0 , nrow=240, ncol =180)
> bluePhoto@red < bluePhoto@green < matrix ( 0 , nrow=240, ncol =180)
> greenPhoto@red < greenPhoto@blue < matrix ( 0 , nrow=240, ncol =180)
> par ( mfrow=c ( 1 , 4) )
> pl ot ( photo )
> pl ot ( redPhoto )
> pl ot ( greenPhoto )
> pl ot ( bluePhoto )
Note that I used the sequential assignment A <B <C <D as a shorthand here. This will
assign the value of D to the variable C then C to B and then B to A. This a lazy trick but one
that you will probably use as it saves a bit of time and typing.
Biological Data Analysis Using R
116 CHAPTER 7. WORKING WITH IMAGES
Then I make a 1x4 matrix of plots so that I can plot all four images in the same frame (see
?? for more on how this is done) and in each of the four slots, I plot one of the images
yielding a gure similar to what is presented in Figure 7.5.
Figure 7.5: The original image along with ones where only the red, green, and blue channel turned
on.
In some cases, it is helpful if you can extract the color information and generalize the
image as a greyscale image (as you will in Chapter 11). Here we use the information
from each channel, weighed equally, in the creation of the image.
> gphoto < pixmapGrey( photo@red+photo@blue+photo@green)
> pl ot ( gphoto )
> names( attri butes ( gphoto ) )
[ 1] "size" "cellres" "bbox" "bbcent" "channels" "grey" "class"
> gphoto@grey[ 1 , 1]
[ 1] 0.8627451
> range ( gphoto@grey )
[ 1] 0 1
The function pixmapGrey() takes a matrix of data, of which we just use the element-wise
addition of each channel in the color photo. You can also see that in the creation of the
new grey image, the values were again standardized.
For the moment, lets examine the contents of this grey image and play around with it a
bit. Lets make it a bit darker by shifting all the grey values down (to make it more black).
We can do this by performing operations on the matrix of grey values in the class. For
simplicity, I will make a copy of the image rst and then perform operations on the copy
rather than the original one. Then we will look at the distribution of grey values that
make the image.
> darkerGphoto < gphoto
> darkerGphoto@grey < darkerGphoto@grey / 2
> par ( mfrow=c ( 1 , 3) )
> pl ot ( gphoto )
> hi st ( gphoto@grey , xlim=c ( 0 , 1) , xlab="Grey" ,main="")
> pl ot ( darkerGphoto )
We can see that the vast majority of values are towards the light end of the distribution.
To darken this up, we should scale these values to be closer to zero by dividing them by
2 and then replotting the image to see the result (see results in Figure 7.6).
Biological Data Analysis Using R
7.5. CREATING IMAGES PROGRAMATICALLY 117
7.5 Creating Images Programatically
Images can be made programatically once you understand how images are represented.
There are some helper functions that can help you in creating new images. For the
purposes of this section, we will focus on greyscale images and allow the analysis of
colored images for you to play with on your own time.
Lets start by making an image where each pixel is randomly assigned a greyscale value.
For convenience, Ill make it the same size as the photo named gphoto from 7.4.1.
> randomImageMatrix < matrix ( rnorm(240180) ,nrow=240, ncol =180)
> gray < grey(1:100/100)
> image ( randomImageMatrix , col =gray )
Here I use the rnorm() function to create 240 180 = 43, 200 random numbers in a matrix
that has 240 rows and 180 columns. I then use the grey() function to create 100 different
shades of grey ranging from white to black at equal intervals. When the image is made,
the range of random numbers is used to divide the pixels into the 100 different grey
colors (e.g., the image() function scales the values in randomImageMatrix into length(gray) distinct
groups for plotting). The results is shown in Figure 7.7.
This image can be manipulated by changing the values in the matrix randomImageMatrix. In
the next example, I replace the center 40x40 block with the white (which would be the
largest value from randomImageMatrix).
> randomImageMatrix[100:140,70:110] < max( randomImageMatrix)
> image ( randomImageMatrix , col =gray )
The result is shown in Figure 7.8 resembling a square doughnut (mmmmdoughnuts...).
Figure 7.6: The greyscale translation of the PPN image, a histogram of the grey values and the
image resulting from reducing all the grey values in the image by half.
Biological Data Analysis Using R
118 CHAPTER 7. WORKING WITH IMAGES
Figure 7.7: A random image Figure 7.8: A random image with a square
doughnut hole in the middle.
7.6 Useful Functions
The following functions were introduced in this chapter and you will be required to use
them for the exercises. To get more information on any of these functions, use the R
help system.
cat() This function dumps the passed arguments out to the terminal.
grey(x) This function returns the grey color associated with the value of x. It is
assumed that that 0 x 1.
image(x) Can be used to create an image as either grey or colors for the values in the
matrix x.
max(x) Returns the maximum value contained in x.
rnorm(x) Returns x random numbers from a N(, ).
Biological Data Analysis Using R
7.7. EXERCISES 119
7.7 Exercises
The following exercises are meant to help you understand the items presented in this
Chapter.
1. Create a Portable Bitmap Format le (*.pbm) exactly like the one that is shown for the
letter R but make it represent the letter L.
2. Why is Figure 7.2 not right-side-up?
3. Make your L image correct by changing the values of the underlying matrix such that
when it is plot using the image command it is in the correct orientation.
4. What is the purpose of the PX number on the rst line of the PNM le formats?
5. Load your own copy of the image Libbie.ppm into R using the read.pnm function as
demonstrated in the Chapter. Create three copies of the image and for each copy
remove the values in one channel (e.g., make one of the color matrices a zero). Plot
these images in a three-paned graphic using the function par(mfrow=c(1,3) option.
6. Replot the randomImageMatrix using a color palette instead of the grey palette shown.
(Hint: See ?rainbow for ve of the stock palettes available to you.)
7. What is the default palette used in the image plot function?
8. What is the purpose of the optional argument bbox in the pixmapGrey function?
9. Create the greyscale version of the image shown in the leftmost box in Figure 7.6.
The grey channel is composed of greyscale values that must be between [0, 1]. Can
you invert the colors in this image? (Hint: If you cant gure out how to do this, see
the footnote at the end of this sentence but only as a last resort.
1
10. Why do you have to use the @ notation to access components of the pixmaps in this
chapter?
1
Are you sure you want a hint? Take 1 minus the grey channel to make the values ipped in the [0, 1]
interval.)
Biological Data Analysis Using R
120 CHAPTER 7. WORKING WITH IMAGES
Biological Data Analysis Using R
Chapter 8
Matrix Analysis
Matrices are used in a wide variety of biological studies. In this Chapter I will use the
example of stage-classied matrix models to introduce you to how matrix manipulation
operates in R . There are some issues that need to be addressed with respect to basic
operations on matrices that if you havent had a course on Matrix Algebra, you may not
fully appreciate.
In this chapter, you will focus on the following topics:
Understand matrix operations in R .
Create stage-classied matrix models.
8.1 Matrices In R
As shown in 2.4.9, a matrix is a fully recognized data type in R . In fact, R does a
wonderful job of working with matrices and is much faster at doing vector and matrix
operations directly than looping through matrices of values using a for()-loop (see 11.1
for a complete discussion of looping R ).
In specic terms for this Chapter, a matrix can be dened as a 2-dimensional object that
holds numeric values. Matrices can be created by hand using the matrix() function and
the elements within them can be accessed using the square bracket notation (e.g., X[i,j])
as:
> X < matrix ( 0 , nrow=4, ncol =4)
> X[ 1 , 2] < 23
> X[ 1 , 4] < 42
> X
[ , 1] [ , 2] [ , 3] [ , 4]
[ 1 , ] 0 23 0 42
[ 2 , ] 0 0 0 0
[ 3 , ] 0 0 0 0
[ 4 , ] 0 0 0 0
You can also wrap the as.matrix() function around the read.table() function and read the
data from a matrix in a le into a variable directly. For a review of these two func-
121
122 CHAPTER 8. MATRIX ANALYSIS
tions see 2.4.9 and 3.1.2. In the online data sets for this chapter, there is a le called
ExampleMatrix.csv that was exported from a spreadsheet. If
> A < as . matrix ( read . tabl e ( "ExampleMatrix.csv" , header=F, sep="\t" ) )
> A
V1 V2 V3 V4 V5 V6 V7 V8 V9
[ 1 , ] 0.00000 2.00000 2.00000 5.00000 4.00000 2.00000 7.00000 2.603310 2.000000
[ 2 , ] 2.00000 0.00000 4.00000 6.00000 3.00000 4.00000 7.00000 3.603310 4.000000
[ 3 , ] 2.00000 4.00000 0.00000 6.00000 4.00000 3.00000 7.00000 1.603310 1.000000
[ 4 , ] 5.00000 6.00000 6.00000 0.00000 3.00000 1.00000 1.00000 3.694210 6.000000
[ 5 , ] 4.00000 3.00000 4.00000 3.00000 0.00000 3.00000 4.00000 1.966940 4.000000
[ 6 , ] 2.00000 4.00000 3.00000 1.00000 3.00000 0.00000 2.00000 2.148760 3.000000
[ 7 , ] 7.00000 7.00000 7.00000 1.00000 4.00000 2.00000 0.00000 4.694210 7.000000
[ 8 , ] 2.60331 3.60331 1.60331 3.69421 1.96694 2.14876 4.69421 0.000000 0.603306
[ 9 , ] 2.00000 4.00000 1.00000 6.00000 4.00000 3.00000 7.00000 0.603306 0.000000
[ 10 , ] 4.00000 5.00000 4.00000 4.00000 4.00000 2.00000 3.00000 3.421490 4.000000
[ 11 , ] 3.00000 5.00000 3.00000 5.00000 6.00000 2.00000 4.00000 3.603310 3.000000
[ 12 , ] 3.00000 4.00000 3.00000 5.00000 3.00000 3.00000 6.00000 1.421490 2.000000
V10 V11 V12
[ 1 , ] 4.00000 3.00000 3.00000
[ 2 , ] 5.00000 5.00000 4.00000
[ 3 , ] 4.00000 3.00000 3.00000
[ 4 , ] 4.00000 5.00000 5.00000
[ 5 , ] 4.00000 6.00000 3.00000
[ 6 , ] 2.00000 2.00000 3.00000
[ 7 , ] 3.00000 4.00000 6.00000
[ 8 , ] 3.42149 3.60331 1.42149
[ 9 , ] 4.00000 3.00000 2.00000
[ 10 , ] 0.00000 1.00000 3.00000
[ 11 , ] 1.00000 0.00000 4.00000
[ 12 , ] 3.00000 4.00000 0.00000
There are a few things to notice here:
1. R wraps values for matrices so that only a portion of each row can be viewed at a
time.
2. The columns of data that were read in the le did not have a header row so R
assigned them the values V1 - V12. This is the default behavior.
3. If there is one value in the matrix that has a decimal portion to it, all the values will
be displayed with the same number of decimal places (e.g., compare the matrix X
and A from the two listings.
8.1.1 Matrix Arithmetic
Matrices have their own special kind of arithmetic that you may not be aware of, so here
is a very short course. For the following examples, I will be using the matrices X
1
, Y,
and Z as dened by the R commands:
> X < matrix ( 1: 9 , nrow=3,byrow=TRUE)
> X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
1
For matrices I will use upper case bold letters for variable names in the text to make it easier to distin-
guish them from non-matrix variables as you read along. Obviously, this is not possible in R itself but for
the text hopefully this will make it easier to follow.
Biological Data Analysis Using R
8.1. MATRICES IN R 123
[ 3 , ] 7 8 9
> Y < matrix ( 9: 1 , nrow=3)
> Y
[ , 1] [ , 2] [ , 3]
[ 1 , ] 9 6 3
[ 2 , ] 8 5 2
[ 3 , ] 7 4 1
> Z < matrix( 1: 12 ,nrow=4)
> Z
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 5 9
[ 2 , ] 2 6 10
[ 3 , ] 3 7 11
[ 4 , ] 4 8 12
One of the main things you have to pay attention to when dealing with matrices is the
number of rows and columns in the matrices. In these example matrices, X and X are
square matrices (e.g., they have the same number of rows and columns whereas X is
not square as it has 4 rows and 3 columns of data. To access the number of rows and
columns in a matrix you must use the function dim().
Scalar Addition & Subtraction
Matrices may be shifted by the addition or subtraction of a constant scalar value (e.g.,
2 + X). Scalar addition and subtraction take the value of the scalar and add it to every
element in the matrix.
> X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
[ 3 , ] 7 8 9
> X + 3
[ , 1] [ , 2] [ , 3]
[ 1 , ] 4 5 6
[ 2 , ] 7 8 9
[ 3 , ] 10 11 12
Matrix Addition & Subtraction
For both addition and subtraction of matrices, the numbers of rows and columns must
be identical. If they are, the addition and/or subtraction operation results in the elemente-
wise addition of each matrix. In R you can use the normal addition (+) and subtraction
(-) operators as demonstrated below.
> X+Y
[ , 1] [ , 2] [ , 3]
[ 1 , ] 10 8 6
[ 2 , ] 12 10 8
[ 3 , ] 14 12 10
But when they are not the same size, R will barf up an error message to you telling you
they are not amenable to this operation.
Biological Data Analysis Using R
124 CHAPTER 8. MATRIX ANALYSIS
> X+Z
Error in X + Z : nonconformable arrays
Scalar Multiplication
The values within a matrix may be scaled by the multiplication of a scalar value (e.g., 0.5
X). Scalar multiplication results in every single element in the matrix being multiplied
by the scalar value. For example:
> X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
[ 3 , ] 7 8 9
> X 2
[ , 1] [ , 2] [ , 3]
[ 1 , ] 2 4 6
[ 2 , ] 8 10 12
[ 3 , ] 14 16 18
Element-wise Multiplication
It is possible to multiply two matrices where what you are wanting is a new matrix that
is the element-wise product of each of the original matrices. This is sometimes called
the Hadamard product or the Schur product. In R this operation is conducted using
the regular multiplication character,
*
, between the two matrices. The result of this
operation is a new matrix, the same dimensions as the two original ones.
> X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
[ 3 , ] 7 8 9
> Y
[ , 1] [ , 2] [ , 3]
[ 1 , ] 9 6 3
[ 2 , ] 8 5 2
[ 3 , ] 7 4 1
> X Y
[ , 1] [ , 2] [ , 3]
[ 1 , ] 9 12 9
[ 2 , ] 32 25 12
[ 3 , ] 49 32 9
Multiplication
Matrix multiplication is slightly more complicated than multiplication among scalars or
multiplying a scalar by a matrix. For example, in matrix multiplication, AB = BA.
This is because of the way that matrices are multiplied. Moreover, there are several
restrictions to which sets of matrices can be multiplied together.
Biological Data Analysis Using R
8.1. MATRICES IN R 125
For example, consider the operation A = XY where the matrix X has r
X
rows and c
X
columns of data and the matrix Y has r
Y
rows and c
Y
columns of data. For this operation
to be dened, the number of columns in X, c
X
, must equal the number of rows in Y (e.g.,
c
X
= r
Y
). If these are not equal, then you cannot perform the multiplication. Moreover,
the resulting matrix A will have r
X
rows and c
Y
columns. This is because the matrix
multiplication is conducted as:
A
ij
=
N
k=1
X
i,k
Y
k,j
Essentially every row of X is multiplied against the corresponding column of Y.
In R matrix multiplication uses a unique operator that you probably havent seen yet. To
indicate that you want two matrices to be multiplied (and not the Hadamard product as
above) you use the compound operator % %. That is right, it is a pair of percent signs
surrounding the normal multiplication character (a.k.a. the asterisk). Two examples
using the matrices X and Y are given below. Notice how XY = YX.
> X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
[ 3 , ] 7 8 9
> Y
[ , 1] [ , 2] [ , 3]
[ 1 , ] 9 6 3
[ 2 , ] 8 5 2
[ 3 , ] 7 4 1
> X %% Y
[ , 1] [ , 2] [ , 3]
[ 1 , ] 46 28 10
[ 2 , ] 118 73 28
[ 3 , ] 190 118 46
> Y %% X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 54 72 90
[ 2 , ] 42 57 72
[ 3 , ] 30 42 54
> X %% I
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
[ 3 , ] 7 8 9
> X (X %% I )
[ , 1] [ , 2] [ , 3]
[ 1 , ] 0 0 0
[ 2 , ] 0 0 0
[ 3 , ] 0 0 0
> I %% X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
[ 3 , ] 7 8 9
>
Here both X and Y are both square and have the same number of rows and columns
(e.g., the simplest case because we dont have to make sure the correct rows and columns
match). The identity matrix, I dened in the section above is shown here with its groovy
Biological Data Analysis Using R
126 CHAPTER 8. MATRIX ANALYSIS
properties. Matrix multiplication by the identity matrix is transitive and will result in
the original matrix. A kind of matrix version of the scalar multiplying by one.
2
Here is an example using the matrices X and Z, who have different dimensions.
> Z
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 5 9
[ 2 , ] 2 6 10
[ 3 , ] 3 7 11
[ 4 , ] 4 8 12
> Z %% X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 84 99 114
[ 2 , ] 96 114 132
[ 3 , ] 108 129 150
[ 4 , ] 120 144 168
> X %% Z
Error in X %% Z : nonconformable arguments
In the rst case, Z %%X is dened and provides a result because the number of columns
in Z match the number of rows in X. The reverse of this multiplication, X %%Z, is
undened and R tells you so.
8.1.2 Matrix Operations
There are several other operations that can be conducted on matrices that you will
probably run across as you begin playing with matrices. Here are a smattering of a
few.
The Diagonal
It is often necessary to interact with the diagonal, dened as the elements in the matrix
whose row index are equal to the column index, of a matrix. For example, in a covariance
matrix, the diagonal elements are the variance estimates. In R you can get access to
the diagonal of a matrix by using the diag(). Some examples using the diag() function
include:
> X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
[ 3 , ] 7 8 9
> diag (X)
[ 1] 1 5 9
> Z
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 5 9
[ 2 , ] 2 6 10
[ 3 , ] 3 7 11
[ 4 , ] 4 8 12
> diag ( Z)
[ 1] 1 6 11
2
There are other matrices that have this property that are not as simple as this one and if you take some
multivariate statistics, it will blow your mind how cool they are...
Biological Data Analysis Using R
8.1. MATRICES IN R 127
Notice how even for non-square matrices the diagonal is dened. You can also extract
and insert particular values for the diagonal as demonstrated below:
> X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
[ 3 , ] 7 8 9
> origDiag < diag (X)
> origDiag
[ 1] 1 5 9
> diag (X) < c(42,23,4)
> X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 42 2 3
[ 2 , ] 4 23 6
[ 3 , ] 7 8 4
> diag (X) < origDiag
> X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
[ 3 , ] 7 8 9
A commonly used matrix that can easily be constructed using the diag() function is the
Identity Matrix, whose symbol is I. This matrix has the zeros everywhere except on the
diagonal
> I < matrix ( 0 , nrow=3, ncol =3)
> diag ( I ) < 1
> I
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 0 0
[ 2 , ] 0 1 0
[ 3 , ] 0 0 1
Finally, there is an operator called the trace of a matrix that is typically written as tr(A),
which is the sum of the diagonal elements. If A is a variance, covariance matrix as is
commonly found in multivariate statistics, then its trace is the overall variance. In R we
can nd the trace using a combination of the sum() and diag() functions as:
> X
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 2 3
[ 2 , ] 4 5 6
[ 3 , ] 7 8 9
> sum( diag ( X ) )
[ 1] 15
Matrix Determinant
The determinant of a matrix is scalar factor of a matrix. The calcuation of the determi-
nant is somewhat complicated when we get to matrices that have more than two rows
and columns and Ill let you go nd a linear algebra book to look into it if you so desire.
For small matrices, the determinant of a matrix, denoted as |A| is given as:
Biological Data Analysis Using R
128 CHAPTER 8. MATRIX ANALYSIS
|A| =
a
11
a
12
a
21
a
22
= a
11
a
22
a
12
a
21
In R the function det() is used to estimate the determinant of a matrix.
> X < matrix ( c( 1 , 6 , 3 , 4) , nrow=2)
> X
[ , 1] [ , 2]
[ 1 , ] 1 3
[ 2 , ] 6 4
> det (X)
[ 1] 14
Matrix Transpose
The transpose of a matrix is an operation that exchanges the row and column indices
of the elements. This will change the dimensions of the matrix if it is not square. No-
tationally, you will see several different ways to represent a transpose such as A
or
A
T
.
In R the transpose operation is performed with the t () function.
> Z
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 5 9
[ 2 , ] 2 6 10
[ 3 , ] 3 7 11
[ 4 , ] 4 8 12
> t ( Z)
[ , 1] [ , 2] [ , 3] [ , 4]
[ 1 , ] 1 2 3 4
[ 2 , ] 5 6 7 8
[ 3 , ] 9 10 11 12
> t ( t ( Z) )
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 5 9
[ 2 , ] 2 6 10
[ 3 , ] 3 7 11
[ 4 , ] 4 8 12
Notice that the transpose of a transpose is equal to the original variable.
Matrix Inversion
For scalars, the inverse is dened as x
1
=
1
x
but for matrices it is slightly more com-
plicated. There are even large groups of matrices that cannot be inverted. One property
that prevents inversion is if the matrix is singular (think black hole of mathematics or
matrices that have a zero determinant).
A common use for matrix inversion is in estimation of regression coefcients by least
squares. In 6.2, we used the lm() function to estimate the intercept and slope coefcients.
This can be done using matrix algebra and the inversion function ginv() found in the MASS
library. A one column matrix of slope coefcients B is estimated from the formula:
Biological Data Analysis Using R
8.1. MATRICES IN R 129
B = (X
X)
1
X
Y
Where the matrix Y matrix is the normal matrix of response variables and the X matrix
has the rst column of all ones (1) for the intercept and the remaining columns as the
predictor variables.
> X < matrix ( c ( rep( 1 , 10) , 1: 10) , ncol=2 )
> X
[ , 1] [ , 2]
[ 1 , ] 1 1
[ 2 , ] 1 2
[ 3 , ] 1 3
[ 4 , ] 1 4
[ 5 , ] 1 5
[ 6 , ] 1 6
[ 7 , ] 1 7
[ 8 , ] 1 8
[ 9 , ] 1 9
[ 10 , ] 1 10
> Y < matrix ( c(19,25,14,15,24,17,19,27,29,25))
> Y
[ , 1]
[ 1 , ] 19
[ 2 , ] 25
[ 3 , ] 14
[ 4 , ] 15
[ 5 , ] 24
[ 6 , ] 17
[ 7 , ] 19
[ 8 , ] 27
[ 9 , ] 29
[ 10 , ] 25
> l i brary (MASS)
> ginv ( t (X) %% X ) %% ( t (X) %% Y )
[ , 1]
[ 1 , ] 16.3333333
[ 2 , ] 0.9212121
> lm( Y c ( 1: 10) )
Cal l :
lm( formula = Y c ( 1: 10) )
Coef f i ci ent s :
( I ntercept ) c ( 1: 10)
16.3333 0.9212
You can see from the comparison, both lm() and the matrix multiplication/inversion
method produce the same estimates for the intercept and the slope coefcient. If you
were to make Z <Y mean(Y) (e.g., standardize it for mean zero), you could have the X ma-
trix without the column for the interscept (
0
= 0) and you could get the same estimate
for the slope coefcient,
1
.
Eigen Decompositions
An eigenvalue/eigenvector decomposition is a magical property of matrices that can
only be appreciated by some experience in matrix algebra. However, we will be using
them in the next section so it seems there is a need to introduce them here. Start by
Biological Data Analysis Using R
130 CHAPTER 8. MATRIX ANALYSIS
considering the square (kxk) matrix X and the identity matrix (I) in the characteristic
equation |AI| = 0.
Using the matrix:
> A < matrix ( c( 1 , 6 , 3 , 4) , nrow=2)
> A
[ , 1] [ , 2]
[ 1 , ] 1 3
[ 2 , ] 6 4
The eigenvalues for the matrix are given by solving the characteristic formula:
0 = |AI| (8.1)
=
_
1 3
6 4
_
_
1 0
0 1
_
_
1 3
6 4
_
= (1 )(4 ) 18
=
2
5 14
If we solve for we see that possible values are 7 and 2. These are called the eigenvalues
of the matrix A.
Each eigenvalue has an associated eigenvector such that:
Ax = x
Where x is a vector (e.g, a matrix with only one column) that is matched to each of
the k eigenvalues. The equation above is called the characteristic equation for the right
eigenvector and a left eigenvector exists and has the form xA = x. From both of these,
we need to solve for x. Starting with the largest eigenvalue,
1
= 7, we have:
_
1 3
6 4
_ _
e
1
e
2
_
=
1
_
e
1
e
2
_
(8.2)
If we multiply these out, we get the following equations:
1e
1
+ 3e
2
= 7e
1
6e
1
+ 4e
2
= 7e
2
And here we have two equations in two variables and can easily solve for the values
of e
1
and e
2
and these values dene the eigenvector v
1
= [e
1
, e
2
] that is linked to the
eigenvalue
1
. We can do the same for the second vector (which I will let you play with in
those boring weekend hours where you are wishing that you had some really cool math
problem to solve).
Biological Data Analysis Using R
8.1. MATRICES IN R 131
It is important to point out here that the values for v
1
can be scaled. As you look at the
equations above we can solve for the components and nd that e
1
=
e
2
2
. There are a lot
of values for e
1
and e
2
that make this statement true. And if we think about the vector
v
1
= [e
1
, e
2
] as a project away from the origin a distance of e
1
on one axis and e
2
on a
second orthogonal axis it may make a bit more sense. There are several vectors that will
point in a direction that will intersect the point (e
1
, e
2
) all of which are the same except
for a scaling factor. This is graphically shown in Figure 8.1.2 with two vectors pointing
in the same direction but with different lengths.
Figure 8.1: Image depicting two vectors v
red
= [4, 2] and v
blue
= [2, 1] that are projecting in the
same direction but have different magnitudes.
The reason I bring this up is that it is common for routines that calculate vectors, such
as we are doing here for the eigenvector decomposition, to scale the vectors such that
their lengths are set to some normalizing constant such as 1. As a result, if you solve for
v
1
and then check it below with the eigen() function you may not get the same values but
if you were to plot the vectors, the lines away from the origin would be pointing in the
same direction.
There are some interesting properties of eigenvalues and eigenvectors.
If the original matrix is symmetric (actually non-negative semi-denite but whose
Biological Data Analysis Using R
132 CHAPTER 8. MATRIX ANALYSIS
watching), the original matrix A =
k
i=1
i
e
i
e
i
. This is called the spectral decompo-
sition of the matrix A.
The product of the eigenvalues is equal to the determinant of the original matrix
(e.g.,
k
i=1
i
= |A|).
The sum of the eigenvalues is equal to the trace of the matrix (e.g.,
k
i=1
n
i
i
= tr(A)
where n
i
is a
If it is possible to invert A then the eigenvalues of A
1
will be the inverse of the
eigenvalues of A (e.g,. they will be
1
i
.
The eigenvectors of A and A
1
are identical.
R has a eigen() function that takes a square matrix and returns the eigen values and
eigenvectors as a list. Here is an example using our little friend the A matrix we touched
on above.
> A
[ , 1] [ , 2]
[ 1 , ] 1 3
[ 2 , ] 6 4
> rootsOfA < eigen ( A)
> rootsOfA
$values
[ 1] 7 2
$vectors
[ , 1] [ , 2]
[ 1 , ] 0.4472136 0.7071068
[ 2 , ] 0.8944272 0.7071068
Baring the possibility that I actually just copied and pasted the results from eigen() into
the discussion above on v
i
= [e
1
, e
2
], the answer looks like it should.
8.2 Stage-Classied Matrix Models
Stage-classied matrix models are concerned with understanding the processes that in-
uence the persistence of populations. These models tacitly assume that the continuum
of life histories for a species can be partitioned into discrete stages and that a census
of individuals in a population can be performed wherein we can tally the number of
individuals in each of these discrete stages. Some species lend themselves to stage-
classication better than others and the distinctions on how to go about dening stages
is best left to another course. Here we are going to introduce the notation of a matrix
model in R and then perform some analyses on these models. This Chapter is intended
to only whet your appetite a bit on matrix models and for those that are interested, you
should seek out another course or at least read a good text such as Caswell (2001).
8.2.1 Transition Matrices & Census Vectors
For the sake of discussion, lets assume that we are working with a plant, Grenus growii,
that has the following four different distinct life stages. Moreover, from our vast knowl-
Biological Data Analysis Using R
8.2. STAGE-CLASSIFIED MATRIX MODELS 133
edge of this organism, we have the accompanying information about the way this species
proceeds through life stages.
Seed The seed stage lasts a single time step (e.g., there is no persistent seed bank) and
only 50% of the seeds actually germinate, the others are either eaten or rot.
Seedling The seedling stage is a non-reproductive stage and herbivory removes 20%
of the individuals that get into this stage and the remaining individuals become
juveniles.
Juvenile The juvenile stage is the rst reproductive stage and on average each juvenile
produces 1.3 offspring. Depending upon the habitat the juvenile is located in, half
move on to the next stage and a quarter stay as a juvenile. The remining ones are
eaten.
Adult The nal adult stage is where most of the reproduction happens with each indi-
vidual producing an average of 3.1 offspring. Half of the adults persist in the adult
stage from one time step to the next.
A diagram of this ctions species is shown in Figure 8.2.
Figure 8.2: The A graphical depiction of the life history stages in the ctitious plant Grenus growii
Here each of the spheres in this image represent a stage. The arrows between the stages
depict either fertility estimates (labeled f
X
) when they point back to the seed stage, or
transitions (labeled p
XY
signifying the probability that an individual proceeds to stage
X from stage Y . From the description we have above, we can associate values with
this particular life history diagram with particular parameters. In Table 8.1 I show the
parameters for each of the variables listed.
These parameters can now be put into a transition matrix
3
, A, that has a particularly
strict form.
A =
_
_
f
1
f
2
f
3
f
4
p
21
p
22
p
23
p
24
p
31
p
32
p
33
p
34
p
41
p
42
p
43
p
44
_
_
(8.3)
3
Actually this is not a transition matrix as it does not sum to 1 rather it is a Leslie matrix but I think I
can get away with generalizing the term a bit here.
Biological Data Analysis Using R
134 CHAPTER 8. MATRIX ANALYSIS
Table 8.1: Table of life history values separated into A Fertility estimates (the fX items) and B
transition probabilities depicting the movement between stages and within stages.
A. Fertility Estimates
Stage Parameter Value
Seed f
1
0
Seeding f
2
0
Juvenile f
3
1.3
Adult f
4
3.1
B. Transition probabilities.
Transition Parameter Value
Seed Seedling p
21
0.5
Seedling Juvenile p
32
0.8
Juvenile Adult p
43
0.5
Juvenile Juvenile p
33
0.25
Adult Adult p
44
0.5
The items in the matrix are partitioned into two components, the top row records the
fecundity values, f
X
, and the second and remaining rows depict the probabilities of
transition, p
XY
. Inserting the observed values into this matrix gives us:
A =
_
_
0 0 1.3 3.1
0.5 0 0 0
0 0.8 0.25 0
0 0 0.5 0.5
_
_
(8.4)
In R we can create this matrix using the following code:
> A < matrix ( 0 , nrow=4, ncol =4)
> A[ 1 , 3] < 1.3
> A[ 1 , 4] < 3.1
> A[ 2 , 1] < 0.5
> A[ 3 , 2] < 0.8
> A[ 3 , 3] < 0.25
> A[ 4 , 3] < 0.5
> A[ 4 , 4] < 0.5
> A
[ , 1] [ , 2] [ , 3] [ , 4]
[ 1 , ] 0.0 0.0 1.30 3.1
[ 2 , ] 0.5 0.0 0.00 0.0
[ 3 , ] 0.0 0.8 0.25 0.0
[ 4 , ] 0.0 0.0 0.50 0.5
The entries in this matrix have some rather special properties if we put the values into
it as directed.
Biological Data Analysis Using R
8.2. STAGE-CLASSIFIED MATRIX MODELS 135
Intrinsic Growth Rate
The Euler-Lotkas integral equation for the instantaneous grow rate, r, is well known to
most biologists (...) and has the form:
1 =
_
0
l(x)m(x)e
rx
dx
where the term l(x) is the fraction of reproductive individuals surviving to x, m(x) is the
fertility rate of individuals at x, and r is the growth. The r component here is the part
that we are interested in looking at because:
r =
_
_
_
< 1 : Populationsizedecayingexponentially
= 1 : Stablesizethroughtime
> 1 : Populationsizeincreasingexponentially
We can provide an estimate of r using an eigenvalue decomposition of the transition
matrix A. Due to the way the matrix is set up, the largest non-imaginary eigenvalue of
the matrix (
1
as dened in 8.1.2) is equal to r. So, once the matrix A is entered into R
, we can nd the growth parameter as:
> A
[ , 1] [ , 2] [ , 3] [ , 4]
[ 1 , ] 0.0 0.0 1.30 3.1
[ 2 , ] 0.5 0.0 0.00 0.0
[ 3 , ] 0.0 0.8 0.25 0.0
[ 4 , ] 0.0 0.0 0.50 0.5
> eigen ( A)
$values
[ 1] 1.2075472+0.0000000i 0.0067844+0.8194141i 0.00678440.8194141i
[ 4] 0.4439783+0.0000000i
$vectors
[ , 1] [ , 2] [ , 3] [ , 4]
[ 1 , ] 0.8603823+0i 0.7490103+0.0000000i 0.7490103+0.0000000i 0.4753001+0i
[ 2 , ] 0.3562521+0i 0.00378390.4570089i 0.0037839+0.4570089i 0.5352740+0i
[ 3 , ] 0.2976372+0i 0.4052283+0.1306829i 0.40522830.1306829i 0.6170499+0i
[ 4 , ] 0.2103303+0i 0.1682952+0.1431813i 0.16829520.1431813i 0.3268348+0i
here we can see that
1
is not a complex number (the +0.0000000i part tells us that) even
though there are some complex eigenvalues (roots) of this matrix. Moreover, it suggests
that the overall behavior of this transition matrix is to increase overall population size
with an instantaneous rate of r 1.2.
The particular values of will determine the overall long term behavior of the population.
Essentially as time increases t : 0 , the impact of is determined by raising it to
higher and higher powers. Figure 8.3 shows the projected impact on population growth
rate as a function to two values for
red
= 0.8 and
blue
= 1.2.
Biological Data Analysis Using R
136 CHAPTER 8. MATRIX ANALYSIS
Figure 8.3: Effects of the instantaneous growth rate as a function of time for both exponential
growth (
blue
= 1.2) and exponential decay (
red
= 0.8).
Stable Stage Distribution
The values in A also contain information on the relative proportion of individuals that
will be in each stage class as the population stabilizes into a steady state (either growth,
stable, or declining). This information is contained in the eigenvector that is associated
with
1
. From the output above we see that:
> ssd < as . numeric ( eigen ( A) $vectors [ , 1] )
> ssd
[ 1] 0.8603823 0.3562521 0.2976372 0.2103303
> sum( ssd)
[ 1] 1.724602
> ssd < ssd / sum( ssd)
> ssd
[ 1] 0.4988875 0.2065706 0.1725831 0.1219587
> sum( ssd)
[ 1] 1
Here you see that the eigenvalues are scaled to unit size (e.g., t(e i ) %%e i = 1) as mentioned
above which results in a total sum of the vector of sum(ssd) = 1.724602. If we are interested in
Biological Data Analysis Using R
8.2. STAGE-CLASSIFIED MATRIX MODELS 137
nding the proportion of the population that is in each stage then we need to standardize
the vector so that the sum(ssd) = 1 and this is done by dividing every element by the total.
As a result, ssd suggests that at equilibrium there should be 49% of the individuals as
seeds, 21% as seedlings, 17% as juveniles and 12% as adults.
We will return to these numbers and the estimate for r in the next subsection when we
iterate the data manually.
Bar Plots
As in the previous example, we determined the stable age distribution to estimate the
proportion of the total population that is in each group. Graphically, this material could
be depicted as a bargraph and since we havent covered how to make bar graphs yet,
this is as good a time as any...
There is an option in the normal plot () function, type="h" that will kind of plot bars of your
data to a gure. Actually, these are high density lines and not real bar plots. This is
what I used to make Figure 4.2 and at that time it got the job done correctly, but a true
bar plot is something that looks a bit different than those lines.
R provides the function barplot() that takes a vector of heights and produces a general
barplot for you. Without modications, the function barplot() does not produce a very
interesting plot in my opinion. However, there are several optional arguments that can
be used to create a more informative graphic. They include:
names.arg a vector of names that you can have placed on the xaxis below the bars
width controls the width of the bars.
space controls the amount of area between the bars with a value of zero having the
bars touch and positive numbers equal to that number of bar width (e.g., space=2
plots a bar and then 2 bar widths before the next bar shows up).
horiz is a logical ag that will plot the bars horizontally instead of vertically.
col can pass as a single color or a vector of colors which are used to color the bars.
ylim can adjust the limit of the yaxis as in normal plotting routines.
xlab \& ylab Labels for the x and yaxes.
Using the data from
1
in the previous section, we can plot the data as (shown in Figure
8.4.
> ssd
[ 1] 0.4988875 0.2065706 0.1725831 0.1219587
> barplot ( ssd)
> barplot ( ssd , ylim=c ( 0 , 1) , xlab="Stage" , ylab="Proportion of Individuals" ,
+ names. arg=c ( "Seed" ,"Seedling" ,"Juvenile" ,"Adult") , col =c ( "red" ,"blue" ,"green" ,"yellow" ) )
The barplot() function can also be used to create stacked graphs 8.5
To create this example, I used the following code which as t
Biological Data Analysis Using R
138 CHAPTER 8. MATRIX ANALYSIS
Figure 8.4: Examples of two different calls to the plotting function barplot(). The parameters used
to create these plots is given in the R code.
> x < matrix ( runi f ( 9) , nrow=3)
> x
[ , 1] [ , 2] [ , 3]
[ 1 , ] 0.2355922 0.396869276 0.5674993
[ 2 , ] 0.7247734 0.001881527 0.9215767
[ 3 , ] 0.4625868 0.767329832 0.6408461
> barplot ( x, names. arg=c ( "Control" ,"A" ,"B") , xlab="Treatments" , ylab="Value" ,
+ legend=c ( "Category A" ,"Category B" ,"Category C" ) )
These stacked plots treat every column of data as a single bar and the order in which the
rows are presented is the order in which the stacking occurs. You can standardize the
plot to all have the same height by dividing each column by that columns sum providing
a proportional barplot.
Biological Data Analysis Using R
8.2. STAGE-CLASSIFIED MATRIX MODELS 139
Figure 8.5: Example of a stacked bar plot with multiple categories represented in each Treatment.
8.2.2 Projecting Stage Sizes
In this matrix model we have been playing with, the census count of individuals in
each of the four stages can be represented by the vector n and in R as a matrix whose
dimensions are (4x1). Assuming that I start with 12 seeds, 34 seedlings, 21 juveniles, and
12 adults, the vector can be depicted as:
> n < matrix ( c(12,34,21,12))
> n
[ , 1]
[ 1 , ] 12
[ 2 , ] 34
[ 3 , ] 21
[ 4 , ] 12
Using this notation, we can predict what the number of individuals in the next time slice
will be given A and n as:
n
t+1
= An
t
Biological Data Analysis Using R
140 CHAPTER 8. MATRIX ANALYSIS
> A
[ , 1] [ , 2] [ , 3] [ , 4]
[ 1 , ] 0.0 0.0 1.30 3.1
[ 2 , ] 0.5 0.0 0.00 0.0
[ 3 , ] 0.0 0.8 0.25 0.0
[ 4 , ] 0.0 0.0 0.50 0.5
> n
[ , 1]
[ 1 , ] 12
[ 2 , ] 34
[ 3 , ] 21
[ 4 , ] 12
> A %% n
[ , 1]
[ 1 , ] 64.50
[ 2 , ] 6.00
[ 3 , ] 32.45
[ 4 , ] 16.50
So after one generation, we can see that the number of seeds, juveniles, and adults all
increased but the number of seedlings decreased. If we look at the next time step, we
see that:
n
t+2
= An
t+1
= AAn
t+1
= A
2
n
t
And in general the vector of stage sizes at any arbitrary time step can be written as:
n
t
= A
t
n
0
(8.5)
Lets make a matrix of n values for time 1 11 in R and calculate the number of individ-
uals in each stage for each time step. I use 11 here because the matrix starts counting
at column 1 which will correspond to our time t = 0 so when t = 10 the column will be
11. Lets also set the rst column (our t = 0) equal to the census population size we were
using above.
> N < matrix ( 0 , nrow=4, ncol =11)
> N
[ , 1] [ , 2] [ , 3] [ , 4] [ , 5] [ , 6] [ , 7] [ , 8] [ , 9] [ , 10] [ , 11]
[ 1 , ] 0 0 0 0 0 0 0 0 0 0 0
[ 2 , ] 0 0 0 0 0 0 0 0 0 0 0
[ 3 , ] 0 0 0 0 0 0 0 0 0 0 0
[ 4 , ] 0 0 0 0 0 0 0 0 0 0 0
> N[ , 1] < n
> N
[ , 1] [ , 2] [ , 3] [ , 4] [ , 5] [ , 6] [ , 7] [ , 8] [ , 9] [ , 10] [ , 11]
[ 1 , ] 12 0 0 0 0 0 0 0 0 0 0
[ 2 , ] 34 0 0 0 0 0 0 0 0 0 0
[ 3 , ] 21 0 0 0 0 0 0 0 0 0 0
[ 4 , ] 12 0 0 0 0 0 0 0 0 0 0
Now, for time steps 1 10 (and in the matrix N columns 2 11) we will use the equation
8.5 to calculate the number of individuals in each group.
Biological Data Analysis Using R
8.2. STAGE-CLASSIFIED MATRIX MODELS 141
> t < 1
> N[ , ( t +1) ] < A %% N[ , t ]
> t < t + 1
> N
[ , 1] [ , 2] [ , 3] [ , 4] [ , 5] [ , 6] [ , 7] [ , 8] [ , 9] [ , 10] [ , 11]
[ 1 , ] 12 64.50 0 0 0 0 0 0 0 0 0
[ 2 , ] 34 6.00 0 0 0 0 0 0 0 0 0
[ 3 , ] 21 32.45 0 0 0 0 0 0 0 0 0
[ 4 , ] 12 16.50 0 0 0 0 0 0 0 0 0
> t
[ 1] 2
OK, here I am going to do something that saves some typing (you can use the up cursor
key to repeat the last entry you typed in the R interpreter and I will use this to make my
life a bit easier). I have dened the variable t such that it will be used to indicate which
column of the matrix to use (the ( t+1) part) as well as the exponent to the matrix A. Then
I will increment the variable t by one and redo it again and again until Ive lled up the
columns of N.
In the following code examples, I show that you can use a semicolon (;) to put more than
one command on a line. Again, I combine the assignment of counts to the appropriate
column of N and then update the counter variable t each time through until all eleven
columns are full. In Chapter 11 you will learn how to use a loop to do this much easier
but until then using the up cursor key in the R interpreter is good enough.
> N[ , ( t +1) ] < A %% N[ , t ] ; t < t + 1
> N
[ , 1] [ , 2] [ , 3] [ , 4] [ , 5] [ , 6] [ , 7] [ , 8] [ , 9] [ , 10] [ , 11]
[ 1 , ] 12 64.50 93.3350 0 0 0 0 0 0 0 0
[ 2 , ] 34 6.00 32.2500 0 0 0 0 0 0 0 0
[ 3 , ] 21 32.45 12.9125 0 0 0 0 0 0 0 0
[ 4 , ] 12 16.50 24.4750 0 0 0 0 0 0 0 0
> N[ , ( t +1) ] < A %% N[ , t ] ; t < t + 1
> N[ , ( t +1) ] < A %% N[ , t ] ; t < t + 1
> N[ , ( t +1) ] < A %% N[ , t ] ; t < t + 1
> N[ , ( t +1) ] < A %% N[ , t ] ; t < t + 1
> N[ , ( t +1) ] < A %% N[ , t ] ; t < t + 1
> N[ , ( t +1) ] < A %% N[ , t ] ; t < t + 1
> N[ , ( t +1) ] < A %% N[ , t ] ; t < t + 1
> N[ , ( t +1) ] < A %% N[ , t ] ; t < t + 1
> N
[ , 1] [ , 2] [ , 3] [ , 4] [ , 5] [ , 6] [ , 7] [ , 8]
[ 1 , ] 12 64.50 93.3350 92.65875 95.68719 131.93725 168.77519 193.20372
[ 2 , ] 34 6.00 32.2500 46.66750 46.32937 47.84359 65.96862 84.38759
[ 3 , ] 21 32.45 12.9125 29.02813 44.59103 48.21126 50.32769 65.35682
[ 4 , ] 12 16.50 24.4750 18.69375 23.86094 34.22598 41.21862 45.77316
[ , 9] [ , 10] [ , 11]
[ 1 , ] 226.86065 281.25553 343.80907
[ 2 , ] 96.60186 113.43032 140.62776
[ 3 , ] 83.84928 98.24381 115.30521
[ 4 , ] 55.56499 69.70713 83.97547
So this is a large number of values here so lets plot this out to see what the stages do
as we go through 10 time steps. The code used to produce the image in Figure 8.2.2
is:
> pl ot ( 1: 11 ,N[ 1 , ] , xlab="" , ylab="" , axes=F, bty="n" , col ="red" , ylim=ylim, type="l" , lwd=2)
> par ( new=T)
> pl ot ( 1: 11 ,N[ 2 , ] , xlab="" , ylab="" , axes=F, bty="n" , col ="blue" , ylim=ylim, type="l" , lwd=2)
Biological Data Analysis Using R
142 CHAPTER 8. MATRIX ANALYSIS
> par ( new=T)
> pl ot ( 1: 11 ,N[ 3 , ] , xlab="" , ylab="" , axes=F, bty="n" , col ="green" , ylim=ylim, type="l" , lwd=2)
> par ( new=T)
> pl ot ( 1: 11 ,N[ 4 , ] , xlab="t" , ylab="Number of Individuals" , axes=T, bty="n" , col ="pink" ,
+ ylim=ylim, type="l" , lwd=2)
> legend(2,350, c ( "Seed" ,"Seedling" ,"Juvenile" ,"Adult") , col =c ( "red" ,"blue" ,"green" ,"pink") ,
+ lwd=2, bty="n")
I use the par(new=T) to overlay the lines on a single graph (see 4.1.1 for more on this). I
also turn off the labels and axes for the rst three plots because if you plot them over
and over again, they look too dark on the graphic (think printing the same line on top of
itself numerous times). On the last one, I set the labels for the axes and the turn on the
axes. Also included is the code I used to add the legend to the image. See ?legend for a
complete discussion of the options that you can provide to this function.
Figure 8.6: Size of the four stage classes through time.
We can check some of the values that we estimated directly from A using the eigen
decomposition by looking at the numbers in the matrix N. First, the growth rate we
estimated from the rst eigenvalue
1
1.2 looks pretty close to that estimated from the
raw counts.
> eigen ( A) $values [ 1]
Biological Data Analysis Using R
8.2. STAGE-CLASSIFIED MATRIX MODELS 143
[ 1] 1.207547+0i
> sum(N[ , 11] ) / sum(N[ , 10] )
[ 1] 1.215202
And the proportion of individuals in each class was estimated by standardizing the rst
eigenvalue v
1
= v
1
/
4
i=1
v
1i
is pretty close to what we see in N (and I throw in the rst
census so that you dont think I put values in there that were already pretty close).
> N[ , 1] / sum(N[ , 1] )
[ 1] 0.1518987 0.4303797 0.2658228 0.1518987
> N[ , 11] / sum(N[ , 11] )
[ 1] 0.5028525 0.2056811 0.1686445 0.1228219
> ssd
[ 1] 0.4988875 0.2065706 0.1725831 0.1219587
If we were to iterate this a bit longer you would see that the brute force method of
getting the population growth rate and the stable age distributions converge towards
what was estimated. In fact, Figure 8.2.2 shows the mean absolute deviation (MAD)
representing the differences between the distribution of individuals in each stage from
the predicted stable stage distribution (ssd) we calculated earlier. As you can see, it
approaches the expected values pretty quickly.
Biological Data Analysis Using R
144 CHAPTER 8. MATRIX ANALYSIS
Figure 8.7: Differences in estimated proportions of individuals in each stage from what was
expected through time.
8.3 Useful Functions
The following functions were introduced in this chapter and you will be required to use
them for the exercises. To get more information on any of these functions, use the R
help system.
%
*
% Binary operator to performmatrix multiplication. An example would be X \%\\% Y.
as.matrix(x) Coerces the variable x into the data type matrix.
barplot(x) Creates a barplot of the values in x.
det(x) Calculates, if possible, the determinant of the matrix in x.
diag(x) Returns the diagonal (e.g., those entries whose row and column indices are
equal) of the matrix in x.
dim(x) Returns the dimensions of the matrix x (e.g., the number of rows and columns).
Biological Data Analysis Using R
8.3. USEFUL FUNCTIONS 145
eigen(x) Returns the eigenvalue/eigenvector pairs for the matrix in x as a list. Values
are sorted in descending numerical order and vectors are scaled to unit length.
ginv(x) Attempts to calculate the generalized inverse of x.
legend(x,y,c) Creates a legend for the plot at the coordinates (x, y) with the entries
in c.
matrix(x) Creates a new instance of the matrix data type of the values in x. You will
probably need to specify nrow and ncol to set the proper size for the matrices.
read.table(x) Reads the le x into memory. See ?read.table for the copious amounts of
additional parameters that may be needed as well as Chapter 3.
t(x) Returns the transpose of the matrix in x (e.g., reverses the row and column
indices)
Biological Data Analysis Using R
146 CHAPTER 8. MATRIX ANALYSIS
8.4 Exercises
The following exercises are meant to help you understand the items presented in this
Chapter.
1. In considering the instantaneous growth rate r, it was mentioned that
1
> 0 and this
is what you will nd in most cases. However, it is possible to get values of < 0.
For the following values of make a graph of t vs.
t
as shown in Figure 8.3 and
describe the behavior of the population if these were the real values of r.
(a) 1 <
1
< 0.
(b)
1
< 1.
2. Create a matrix of random numbers using the runif () function and make a barplot of
the values. What happens when you pass the optional argument beside=T?
3. Standardize the columns of data in the matrix from the previous example so that the
sum of each column is equal to 1. Replot this with using the function barplot() as done
for Figure 8.5 with the beside=F option. How does standardizing each row inuence
the display of the plot?
Biological Data Analysis Using R
Chapter 9
Working With Strings
While the majority of biological data is numeric in nature there are still several important
reasons to be able to manipulate character-based information. For example, you may be
downloading all the references from a online database such as WebOfScience and want
to mine the abstracts for metadata. You may also be interested in working with sequence
data which consists of mostly text information. In this relatively short chapter we will
learn about how we can work with string in data in R and look at a few examples using
genetic sequences.
In this chapter, you will focus on the following topics:
Learn how to work with string data to perform tasks such as parsing, searching,
and replacement.
Learn how to access sequence based data and pre-process it for importation into R
Learn how to create genetic distance matrices.
Construct Neighbor-Joining trees and display them in R
9.1 Parsing Text Data
At a most basic level you need to understand that character data in R is treated as a
single token in the same way that integer and numeric data is treated. For example,
consider the following code:
> x < c ( "bob" ,"mary" ,"johnathan")
> length ( x )
[ 1] 3
> x < "George Stephen Sr."
> length ( x )
[ 1] 1
> x < c( 1 , 2 , 3)
> length ( x )
[ 1] 3
> x < 3
> length ( x )
[ 1] 1
147
148 CHAPTER 9. WORKING WITH STRINGS
9.1.1 Finding Lengths of Character Sequences
So R treats a character data type, independent of the length of the items in the variable,
as a single entry. Once we understand this then the rest of this Chapter really begins to
take shape and make sense.
So, if R thinks that the everything between a pair of quotes is a single instance of a
character data type then how do we gure out how many letters are contained between
the quotes? The answer here is the function nchar().
> x < "George Stephen Sr."
> nchar ( x )
[ 1] 18
Another commonly used function for dealing with strings is the strsplit () function. This
function takes the string of characters that you are interested in splitting as well as the
character you want to split it on and returns the chunks as a list. This returning-as-a-
list behavior is kind of a pain in the butt so at the same time I introduce this function I
will also show the unlist() function at the same time.
1
> partsOfName < unl i st ( st r spl i t ( x, " ") )
> partsOfName
[ 1] "George" "Stephen" "Sr."
> nchar ( partsOfName )
[ 1] 6 7 3
Here is another example as to how we may go about cycling through a set of words in
a phrase and doing some operation on them. The rst sentence from the rst chapter
of Darwins The Origin Of Species is, WHEN we look to the individuals of the same
variety or sub-variety of our older cultivated plants and animals, one of the rst points
which strikes us, is, that they generally differ much more from each other, than do
the individuals of any one species or variety in a state of nature. While this is a very
interesting sentence, we are going to use it to show you how to break down the sentence
into an array of words and then tally the number of times each word is used.
We begin by making the sentence all lowercase and without punctuation because the
simple matching procedure would consider When different than when and the strsplit ()
function will cut up the string on the spaces (that I what I will tell it to do)
> phrase < "when we look to the individuals of the same variety or sub-variety of our older "
+ "cultivated plants and animals one of the first points which strikes us is that they "
+ "generally differ much more from each other than do the individuals of any one species or "
+ "variety in a state of nature"
> wordList < unl i st ( st r spl i t ( phrase , " " ) )
> tabl e ( wordList )
wordList
a and animals any cul ti vated di f f er
1 1 1 1 1 1
do each f i r s t from general l y in
1 1 1 1 1 1
i ndi vi dual s i s look more much nature
2 1 1 1 1 1
of ol der one or other our
5 1 2 2 1 1
1
This function takes a list and turns the items in it into a vector which is easier to work with.
Biological Data Analysis Using R
9.1. PARSING TEXT DATA 149
plants points same species state stri kes
1 1 1 1 1 1
subvari et y than that the they to
1 1 1 4 1 1
us vari et y we when which
1 2 1 1 1
9.1.2 Extracting Substrings
It is not possible to use the normal subscripting approaches to access the individual
characters within strings because R treats the entire sequence of characters between
the quotation marks as a single item. However, you can extract internal components of
a string by using the substring() function.
> phrase < "A Goat, that was sitting next to the gentleman in white, shut his eyes and said
+ in a loud voice, She ought to know her way to the ticket-office, even if she doesnt know
+ her alphabet! "
> substring ( phrase , 34, 70)
[ 1] "the gentleman in white, shut his eyes"
> substring ( phrase , 98)
[ 1] "She ought to know her way to the ticket-office, even if she doesnt know her alphabet! "
The function takes the string to be searched and the starting and ending locations in
the string and returns the characters in between. If you do not provide an ending
number, it will return all the characters up to the end. This is a shorthand way of saying
substring( phrase, x, nchar(phrase) ).
It is also possible to use vector notation in pulling out substrings by passing vectors to
the start and end arguments.
> startPosi ti ons < c(34,3,58,172,67)
> endPositions < c(36,6,61,174,70)
> substring ( phrase , startPosi ti ons , endPositions )
[ 1] "the" "Goat" "shut" "her" "eyes"
9.1.3 Concatenating Strings
Vectors of character data can be concatenated to form a single long string. This is very
helpful in creating labels for graphs that have to include the value of a variable and
for times when you need to open a lot of data les that have a predictable le naming
scheme. In R string concatenation is accomplished using the paste() function.
> stri ngVector < substring ( phrase , startPosi ti ons , endPositions )
> stri ngVector
[ 1] "the" "Goat" "shut" "her" "eyes"
> paste ( stringVector , col l apse=" ")
[ 1] "the Goat shut her eyes"
> paste ( stringVector , col l apse="|")
[ 1] "the|Goat|shut|her|eyes"
Biological Data Analysis Using R
150 CHAPTER 9. WORKING WITH STRINGS
9.1.4 Matching & Substitution
The nal tasks we will look into in this section on string operations are matching and
substitutions. There are a lot of times when the ability to see if a particular set of
strings has a specic substring within it. This is the realm of matching and is primarily
accomplished by the functions grep() and regexpr(). This last function allows you to use
what are called Regular Expressions (RE) to scan through string. While this is a very
powerful method for pattern matching and is something that if you are going to do any
extensive work with strings should know, I am not going to cover it in this Chapter. In
fact, it probably needs its own chapter and perhaps in a future version of this text I will
include it. For those of you who work with string data on a regular basis, look up the
regexpr function and have at it, it will make your life easier. For the rest of us, lets dig into
grep for a little light matching exercises.
The grep function takes a pattern that you are looking for and a string that you want to
look into. A simple example would be:
> x < "The quick brown fox jumped over the candle stick"
> grep ( "fox" , x )
[ 1] 1
> any( grep ( "fox" , x ) )
[ 1] TRUE
> any( grep ( "o" , x ) )
[ 1] TRUE
> any( grep ( "dog" , x ) )
[ 1] FALSE
In general, the grep function returns an integer indicating that the string either has or
does not have a copy of the pattern in it. I wrapped the grep function here inside the
any() function because it will take either a single argument or a vector of arguments and
return a logical value.
It is also possible to substitute values in a string with new items. There are two functions
that perform string substitutions, sub and gsub. Both of these functions take at least three
arguments;
1. A pattern to match,
2. The string to replace the matched pattern with, and
3. The string to search within.
The sub function replaces the rst occurrence of the pattern whereas gsub replaces all of
them (the g stands for global).
> x < "The quick brown fox jumped over the candle stick with all the kings men."
> sub( "the" ,"THE" , x )
[ 1] "The quick brown fox jumped over THE candle stick with all the kings men."
> gsub( "the" ,"THE" , x )
[ 1] "The quick brown fox jumped over THE candle stick with all THE kings men."
> gsub( "the" ,"THE" , x, ignore . case=T)
[ 1] "THE quick brown fox jumped over THE candle stick with all THE kings men."
Both of these functions have optional arguments, the most common one of which is
ignore.case option that allows the searching and replacing to either take into consideration
Biological Data Analysis Using R
9.1. PARSING TEXT DATA 151
the case of the letters when matching or not.
9.1.5 Slightly More In Depth Examples: Genetic Sequence Analyses
Genetic sequences are essentially long character strings and R has a few different li-
braries available to you for the analysis of sequence data. I am not going to get into what
a genetic sequence is, if you do not already know about it then you probably should not
be calling yourself a biologist... In this section, we will:
1. Briey discuss how we go about getting DNA sequence data
2. Learn how to align sequences
3. Import sequence aligned sequence data into R
4. Create a distance matrix from the sequences
5. Use R to estimate a Neighbor-Joining tree from the sequence data
Getting DNA Sequence Data
The mother of all sequence repositories that you can access (without actually doing the
sequencing yourself) is the NCBI web database located at http://www.ncbi.nlm.nih.gov/
Here you can run database queries based upon taxa, genes, groups, or whatever. The
basic results of a search are given as an annotation (just below). This annotation has
three parts,
1. The meta data in the top section that contains the locus denition, size, who found
it, references and a the taxonomy of the organism.
2. The FEATURES of the record that describe what is in the sequences (coding and
non-coding regions if known), some geographical and taxonomic information that
has been standardized (good for data mining and putting on a map) as well as the
translation of genetic sequence into amino acids if appropriate.
3. The ORIGIN which contains the raw sequence information.
An example of a record is given below
LOCUS FJ347583 278 bp DNA l i near INV 01JUL2009
DEFINITION Araptus attenuatus haplotype 5 muscle protein 20 (MP20) gene ,
part i al sequence .
ACCESSION FJ347583
VERSION FJ347583.1 GI:227345175
KEYWORDS .
SOURCE Araptus attenuatus
ORGANISM Araptus attenuatus
Eukaryota; Metazoa; Arthropoda ; Hexapoda; Insecta ; Pterygota ;
Neoptera ; Endopterygota ; Coleoptera ; Polyphaga ; Cucujiformia ;
Curculionidae ; Scolytinae ; Araptus .
REFERENCE 1 ( bases 1 to 278)
AUTHORS Garrick ,R.C. , Meadows,C. A. , Nason, J.D. , Cognato , A. I . and Dyer ,R. J.
TITLE Variable nuclear markers f or a Sonoran Desert bark beetl e , Araptus
attenuatus Wood ( Curculionidae : Scolytinae ) , with appl i cati ons to
rel ated genera
Biological Data Analysis Using R
152 CHAPTER 9. WORKING WITH STRINGS
JOURNAL Conserv . Genet . 10 ( 4) , 11771179 (2009)
REFERENCE 2 ( bases 1 to 278)
AUTHORS Garrick ,R.C. , Meadows,C. A. , Nason, J.D. , Cognato , A. I . and Dyer ,R. J.
TITLE Di rect Submission
JOURNAL Submitted (26SEP2008) Department of Biology , Vi rgi ni a
Commonwealth University , 1000 West Cary Street , Richmond, VA 23284,
USA
FEATURES Location/Qual i f i ers
source 1..278
/organism="Araptus attenuatus"
/mol type="genomic DNA"
/db xref ="taxon:634056"
/haplotype="5"
gene <1..>278
/gene="MP20"
/note="muscle protein 20; coding region not determined"
ORIGIN
1 ctaaaatcaa cacttccgga ggacaattta aattcatgga aaacatcaac aagtaagaaa
61 aaaataattt gacatgtaaa taatgtagag aaaattcata aacattccta t t t t t t at t g
121 at t t gt caat at t t agt t t g gaactaaact ctgacaatca attatacagg gtgacaattc
181 taat tacat t tccattcaat gccaactaga aatttcgtga aaaaaaaatt gt t t ct at gc
241 caaacatact gt t t t at aag at t t aat t cc agaaattt
//
Sequence Formats & Aligning Genetic Sequences
The format of the sequence data like this is a bit verbose but very informative. When
we work with sequence data we will use an abbreviated le format, the FASTA format,
to work with sequences. This format is very compact and as a result, it is rather easy to
use. In general, FASTA les are simple text les that have blocks of information for each
sequence. Each block contains a summary line that must begin with the greater than
character (>) and can be anything you like. It is common to put the accession numbers,
locus identier, taxonomy and other information into this line. The lines following the
summary line is the raw sequence. If you want to have more than a single taxon in a
le, you just put the next taxon block blow the previous one and continue. In general
they look like this (this is an excerpt from an example data set that you have in the class
folder):
>Pinus caribaea var . hondurensis
GGTTCAAGTCCCTCTATCCCCACCCAGGTTCGGTCCCGAAAGGAYTGATCTATCTTCTCCAATTCCATTG
GTTCGAATCCATTCTAATTTCTCGATTCTTTTACCTCGCTATTTTTTTTTTTTCATGAAGAGAAGAAATT
AGAACATGAATCTTTTCATCCATCTTATGACAAGTTGAGTTGATCTGTTAATAAGTTGATCATATGATCA
ATTTATTTTGTGATATATGATCTACATAGAATAGATTAGATCNTTTTTAAATTATTCAATTGCAGTCCAT
TTTTATCATATTAGTGACTTCCAGATCGAAAATAATAAAGATCATTCTAAAAACTAGTAAAAATACCTTT
TTACTTCTTTTTAGTTGACACAAGTTAAAACCCTGTACCAGGATGATCCACAGGGAAGAGCCGGGGATAG
CTCATTTGGTAAACCAAAGGACTGAAAATCCTCGTGTCACCAGTTCAAAT
>Pinus echinata
ACCCAGGTTCGTTCCCGAACGGATTGATCTATCTTCTCCAATTCCATTGGTTCGAATCCATTCTAATTTC
TCGATTCTTTTACCTCGCTATTTTTTTTTTTCATGAAGAGAAGAAATTAGAACATGAATCTTTTCATCCA
TCTTATGACAAGTTGAGTTGATCTGTTAATAAGTTGATCATATGATCAATTTATTTTGTGATATATGATC
TACATAGAATAGATTAGATCATTTTTAAATTATTCAATTGCAGTCCATTTTTATCATATTAGTGACTTCC
AGATCGAAAATAATAAAGATCATTCTAAAAACTAGTAAAAATACCTTTTTACTTCTTTTTAGTTGACACA
AGTTAAAACCCTGTACCAGGATGATCCACAGGGAAGAGCCGGGATAGCTCAGTTGGTAGAGCAGAGGACT
GAAAATC
When conducting analyses of genetic sequence data, it is important that you are con-
dent that all the sequences you have are of homologous portions of the genome. For the
Biological Data Analysis Using R
9.1. PARSING TEXT DATA 153
example I used here, I downloaded some genetic sequence data for a handful of conifers
in the family Pinaceae from the NCBI website. The sequences I was looking for is a com-
mon inter-genic spacer region between the genes encoding for tRNA-trnL and tRNA-trnF.
These sequences were between 390-470 base pairs in length and are in the le named
confiers.fasta in the folder for this chapter. I cleaned up the summary lines in this le so
it only has the genus and species names rather than all the other stuff. This makes it a
bit easier for you in the future when you interact with the data.
Before I played with these sequences, I ran an alignment on them to make sure we were
dealing with the matching sequences across taxa. There are many ways to do this and I
just used the online ClustalW server at http://align.genome.jp to align the sequences for
me. This is not something you want to do by hand and it is much better to let a computer
do some of the work for you. This algorithm aligns all the sequences and returns the
le in a clustal format. This is another text le but this time all the species have been
displayed in blocks with homologous sequence locations in the same text column. An
example of this is shown below with gaps (insertions/deletions) indicated as the dash
character ().
Pinus caribaea var . hondurensi CCCACCCAGGTTCGGTCCCGAAAGGAYTGATCTATCTTCTCCAATT
Pinus taeda ACCCAGGTTCGTTCCCGAACGGATTGATCTATCTTCTCCAATT
Pinus ponderosa ACCCAGGTTCGTTCCCGAACGGATTGATCTATCTTCTCCAATT
Pinus echinata ACCCAGGTTCGTTCCCGAACGGATTGATCTATCTTCTCCAATT
This le is also located in the folder for this chapter and is called conifers.aln and this is
the le we will be working with.
Getting Aligned Sequences Into R
R does not by default recognize sequence data as anything more elegant than a sequence
of characters. As a result, several people have developed libraries for you to use that have
a lot of general functionality to them. In this section, I am going to use the library ape. If
you do not have this library installed on your machine, see Appendix B for an overview
of the process.
I am assuming that you currently have the data le in a location that you can reach it
easily from within R . To load the aligned sequences into R type the following:
> l i brary ( ape )
> seqs < read . dna( "confiers.aln" , format="clustal")
> cl ass ( seqs )
[ 1] "DNAbin"
> summary( seqs )
23 DNA sequences in binary format stored in a matrix .
Al l sequences of same length : 526
Labels : Abies alba Abies kawakamii Abies vei t chi i Abies homolepis Larix potani ni i Cedrus atl anti ca . . .
Base composition:
a c g t
0.310 0.187 0.160 0.343
Biological Data Analysis Using R
154 CHAPTER 9. WORKING WITH STRINGS
There are several things that you can do with these aligned sequences. You can look for
motifs, examine CG content, etc. I will leave these options for you to play with later in
the exercises.
Constructing A Neighbor Joining Tree
To construct a Neighbor Joining (NJ) tree, we rst need to create a distance matrix that
estimates the distances between pairs of sequences that we have in our le. There are
several different kinds of distance metrics that you can use in the calculation of this
distance matrix (see ?dist.dna for more information on these). We will use the default value
which is Kimuras 2-parameter model called K90.
> D < di st . dna( seqs )
> cl ass (D)
[ 1] "dist"
> summary(D)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.07252 0.09310 0.26890 0.15720 1.45700
The function dist.dna() takes as an argument a set of sequences that you have read in (the
must be of class DNAbin as shown above) and spits out the distance matrix. The distance
matrix, D, is a particular kind of matrix that holds the lower triangle of the pair-wise
distance calculations. If you print it out, you will get a whole lot of output as it prints
the taxa names for row and column headers.
Since D is a general distance matrix, we can look at the values in it. Figure 9.1 shows
a histogram of the distance values that have been estimated in D. From this we see that
there are several values that are low meaning that the sequences are very similar to each
other and then there are some that are 2-3 peaks that are larger suggesting some degree
of sequence divergence.
To create a NJ tree from these distances, we use the function nj () .
> njTree < nj (D)
> cl ass ( njTree )
[ 1] "phylo"
> summary( njTree )
Phylogenetic tree : njTree
Number of t i ps : 23
Number of nodes : 21
Branch lengths :
mean: 0.03838704
variance : 0.01999758
di stri buti on summary:
Min. 1st Qu. Median 3rd Qu. Max.
0.0009736 0.0000000 0.0004898 0.0150700 0.8610000
No root edge .
Fi rst ten t i p l abel s : Abies alba
Abies kawakamii
Abies vei t chi i
Abies homolepis
Larix potani ni i
Cedrus atl anti ca
Larix decidua
Cedrus deodara
Biological Data Analysis Using R
9.1. PARSING TEXT DATA 155
Figure 9.1: Histogram of distance estimates among all sequences using the K90 model of sub-
stitutions
Larix l ari ci na
Pinus roxburghii
No node l abel s .
This function take a distance matrix and returns a tree that is of the class phylo. We
can see that internally the variable njTree has some internal information that may be of
interest (e.g., branch lengths, etc) but the real way we can understand it is by looking at
a graphic of the tree that is produced. To do this, we use the plot () command and pass it
the njTree variable as plot(njTree).
2
The topology of the tree (Figure 9.2) is easy to interpret and it is quite obvious where
those very large distances shown in Figure 9.1 come from. From this topology we can
see that:
1. The Pinus species are generally together forming a polytomy that connects to the
2
You may be surprised by the utility of the plot function as it seems to know how to plot everything. Well
in actuality this function is simply a wrapper that takes whatever you pass to it and determines if the class
of the object you passed has its own plot command. For the tree, the native command is plot.phylo() and you
have to look up that command to see the available options for it.
Biological Data Analysis Using R
156 CHAPTER 9. WORKING WITH STRINGS
Figure 9.2: Neighbor joining tree based upon the trnL-trnF intergenic spacer sequences and the
K90 model of sequence evolution.
other genera in the family.
2. The Larix, Abies, and Cedrus for generally self contained groups.
3. The most divergent groups are the Picea and Keteleeria samples.
There is quite a bit more that can be done here but I think that is enough to get you on
the right track if you are interested in using R for some basic sequence analysis.
9.2 Producing Formatted Output
Often in the use of R there is a need to produce a particular kind of output from an
analysis of to display the contents of a particular variable. R does a pretty good job
itself, but it has some limitations. For example, you may want to print out a matrix of
values but only have 2 decimal places printed for each entry. Or you may want to export
a table of values as HTML so that I can copy and paste it into another program
Biological Data Analysis Using R
9.2. PRODUCING FORMATTED OUTPUT 157
9.2.1 Formatting Strings For Printing
format ( x, trim = FALSE, di gi t s = NULL, nsmall = 0,
j ust i f y = c ( "left" , "right" , "centre" , "none") ,
width = NULL, na. encode = TRUE, sci ent i f i c = NA,
big . mark = "" , big . i nt erval = 3,
small . mark = "" , small . i nt erval = 5,
decimal . mark = "." , zero . pri nt = NULL, drop0trai l i ng = FALSE, . . . )
9.2.2 Formatting Tables
A common type of format to be output to another format is tabular data. Tables are
common features of statistical analysis and as such you will nd it necessary to cut
a table out of R and paste it into a document in the same way that graphics can be
exported from R to be used in your manuscripts and reports.
For these examples, I will just created a matrix of values and add row and column names
using the functions rownames and colnames.
> x < matrix ( rnorm( 12) , nrow=3)
> x
[ , 1] [ , 2] [ , 3] [ , 4]
[ 1 , ] 0.1678067 0.8856766 0.3955881 0.7677516
[ 2 , ] 1.0302831 0.7392326 0.8333904 0.3235135
[ 3 , ] 0.4396607 1.7622323 0.8763023 0.6091688
> colnames ( x ) < c ( "Header A" , "Header B" , "Header C" , "Header D")
> rownames( x ) < c ( "Row 1" , "Row 2" , "Row 3")
> x
Header A Header B Header C Header D
Row 1 0.1678067 0.8856766 0.3955881 0.7677516
Row 2 1.0302831 0.7392326 0.8333904 0.3235135
Row 3 0.4396607 1.7622323 0.8763023 0.6091688
> theMatrixTable < xtabl e ( x, caption="Caption For Table" , al i gn="l|cccc")
The variable theMatrixTable now is a xtable object. What we do with it at this point depends
upon how you want to interact with it.
Getting L
A
T
E
XOutput
If you print it out as is, it will display the contents in L
A
T
E
X, a typesetting language that
is used to create very nice looking manuscripts and books (this entire book has been
written in it). If you use L
A
T
E
Xto write your manuscripts then you are set and the listing
that follows show the formatting that results and the Table 9.1 that follows is what it
looks like when it is inserted into a L
A
T
E
Xdocument.
% l atex tabl e generated in R 2.8.0 by xtabl e 1.54 package
% Wed Dec 31 14:22:46 2008
\begin{tabl e }[ ht ]
\begin{center}
\caption{Caption For Table}
\begin{tabular}{ l | cccc}
\hl i ne
& Header A & Header B & Header C & Header D \\
\hl i ne
Biological Data Analysis Using R
158 CHAPTER 9. WORKING WITH STRINGS
Row 1 & 0.17 & 0.89 & 0.40 & 0.77 \\
Row 2 & 1.03 & 0.74 & 0.83 & 0.32 \\
Row 3 & 0.44 & 1.76 & 0.88 & 0.61 \\
\hl i ne
\end{tabular}
\end{center}
\end{tabl e}
Table 9.1: Caption For Table
Header A Header B Header C Header D
Row 1 0.17 0.89 -0.40 0.77
Row 2 -1.03 0.74 -0.83 -0.32
Row 3 0.44 1.76 -0.88 0.61
You can also print the table to a le by calling the function print(theMatrixTable,le=theleName.tex).
There are several other options available to you with the print function, see ?print.xtable for
more information.
Exporting In HTML for Web or Word
If you do not use L
A
T
E
Xand are a biologist that does a lot of mathematical, programming,
or scientic work then you should be. That being said there are many people for which a
general overpriced and under powered word processor (which shall remain nameless but
is buggy and prone to viruses and screwing up your manuscripts, you know which one I
mean) is the best you can expect to master. The xtable can be exported into a format you
can open up in said program by rst exporting the le as type="html". To export it as such
call the command > print(theMatrixTable,type=html,le=MyHTMLizedTable.html) and the table will be
saved. You can then open it up in your favorite word processor and it will turn the html
table into a normal table that you can manipulate in your documents. An example of the
html markup that this function produces is given below and an image of it is presented
in Figure 9.3.
<! html tabl e generated in R 2.8.0 by xtabl e 1.54 package >
<! Wed Dec 31 14:22:51 2008 >
<TABLE border=1>
<CAPTION ALIGN="bottom"> Caption For Table </CAPTION>
<TR>
<TH> </TH>
<TH> Header A </TH>
<TH> Header B </TH>
<TH> Header C </TH>
<TH> Header D </TH>
</TR>
<TR>
<TD> Row 1 </TD>
<TD al i gn="center"> 0.17 </TD>
<TD al i gn="center"> 0.89 </TD>
<TD al i gn="center"> 0.40 </TD>
<TD al i gn="center"> 0.77 </TD>
</TR>
<TR>
<TD> Row 2 </TD>
<TD al i gn="center"> 1.03 </TD>
<TD al i gn="center"> 0.74 </TD>
<TD al i gn="center"> 0.83 </TD>
Biological Data Analysis Using R
9.3. PLOTTING SPECIAL CHARACTERS 159
<TD al i gn="center"> 0.32 </TD>
</TR>
<TR>
<TD> Row 3 </TD>
<TD al i gn="center"> 0.44 </TD>
<TD al i gn="center"> 1.76 </TD>
<TD al i gn="center"> 0.88 </TD>
<TD al i gn="center"> 0.61 </TD>
</TR>
</TABLE>
The HTML above produces a table that when imported into Firefox looks like that pre-
sented in Figure 9.3.
Figure 9.3: The html printout of a xtable as interpreted in Firefox. You can also import tables
saved as html into popular word processors and use them as normal table items in the creation of
your documents.
There are several other options available to you with the print function, see ?print.xtable for
more information.
9.3 Plotting Special Characters
There are some special characters that you should be aware of when trying to get your
data output into a readable format. These characters are not necessarily ones that you
specically type on the keyboard rather they are ones that are available as their own
buttons on the keyboard, namely the tab character, the newline character, and the bell
character.
All the characters on your keyboard (assuming that you are using an en US keyboard)
are specied in as single variables in ASCII (ASCII stands for the American Standard Code
for Information Interchange). Obviously, since the rst A stands for American, there are
a lot of characters that you see on a computer screen that you cannot type directly on
a keyboard such as letters with accents, Greek and Latin characters (, , ), and then
there are all those non-US English characters and hieroglyphs. Your terminal that you
are running R from cannot handle these characters but you can get them into plots that
you make.
R has the nice ability to produce slightly complicated output for the axes of your plots
as well as for putting into most graphics you produce. Items such as subscripts, su-
perscripts, and mathematical symbols are easily produced using just a few different
functions.
Biological Data Analysis Using R
160 CHAPTER 9. WORKING WITH STRINGS
The primary way for producing formatted text for a graphics output is through the use
of the expression function. And the best method for looking at the ability of R to provide
nice mathy like output is to look at its own demo. So, start R and type:
> demo( plotmath)
This command will show you a short number of tables in a gure window that have
examples of the different kinds of math plotting that R handles. Associated with each
table, when R sources the demo script it passes the optional echo=TRUE parameter so that
all the commands that are used to produce the output are also shown in the R command
interface. This way you can see how each of the cells in the displayed tables is being
encoded. An example of some of the copious output is:
> draw. plotmath. cel l ( expression ( i t al i c ( x ) ) , i , nr ) ; i < i + 1
> draw. plotmath. cel l ( expression ( bold ( x ) ) , i , nr ) ; i < i + 1
> draw. plotmath. cel l ( expression ( bol di t al i c ( x ) ) , i , nr ) ; i < i + 1
The demo script itself dened the function draw.plotmath.cell() so dont worry about that part.
The part you should focus on is the (expression(bold(x)) parts. There are several options that
you can pass to the expression function and it is not quite worth listing them all here since
you see them in the R demo itself. However, I will show some of the more common
methods in the plot shown in Figure 9.4.
> x < rnorm(100)
> y < 23 + 1.4x + 2rnorm(100)
> pl ot ( x, y , bty="n" , ylab=expression (X[ st uf f ] ) , xlab=expression ( chi 2) , col ="red")
For both the x and y-axes, I use the expression function to create labels with subscripts
and superscripts. If you like, you can dene these values as individual variables prior to
plotting if you like to keep the plot command a bit cleaner, there is really no difference in
the speed at which R would evaluate them. Here is another example:
> xl abel < expression ( bold ( x [ i ] ) )
> yl abel < expression ( i t al i c ( x [ i ] 2) )
> pl ot ( x, y , bty="n" , xlim=c( 0 , 20) , type="l" , lwd=2, col ="blue" , xlab=xlabel , ylab=yl abel )
Look at the demo(plotmath) output to see the diversity of plotting approaches.
Biological Data Analysis Using R
9.4. USEFUL FUNCTIONS 161
Figure 9.4: Example of using the expression function to annotate a graphic.
9.4 Useful Functions
The following functions were introduced in this chapter and you will be required to use
them for the exercises. To get more information on any of these functions, use the R
help system.
any(x,y) Returns a logical response to x having any instance of y in it.
cat(x) Concatenates the objects in x and dumps them out to the interface.
format(x) Formats the object x for rigid (some say pretty) printing.
substring(x,s,f) This returns takes the string in x and returns the substring starting
at position s and nishing at position f.
strsplit(x,c)functions!strsplit Splits the string x on the character (or characters in c).
nchar(x) Returns the number of characters in the string x.
expressionx This function takes the variables in x and turns them into a string ex-
pression to be plotted in a function.
Biological Data Analysis Using R
162 CHAPTER 9. WORKING WITH STRINGS
nj(x) This function performs the neighbor joining function on the distance matrix
x.
unlist(x) Takes the list x and returns it as a vector.
Biological Data Analysis Using R
9.5. EXERCISES 163
9.5 Exercises
The following exercises are meant to help you understand the items presented in this
Chapter.
1. Create a table fromthe data data <matrix( rnorm(9), nrow=3 ) and label the rows as c("Richmond",
"Petersburg", "Varina") and the columns as c("PPM(A)", "PPM(B)", "PPM(C)". Use the xtable
library to export this table as HTML and then import it into your answers. (This
is a very helpful methodology for getting formatted data out of R and into your
manuscripts).
2. Create a table of the different words found on the rst page of the Chapter entitled
Preface in this text.
3. Using the strsplit function to break apart the raw text of the rst four paragraphs of
the Chapter entitled Preliminaries into sentences (HINT: use the
. as the character to break apart the string on and you can copy and paste it from
the pdf). Then use the grep command to nd the sentences that have the word are in
them.
4. Show how you would use the sub command to x the sentence, Dr. Dyer is a loser?
(And when I say x I mean make it say that I am not...)
5. How many characters are in the rst paragraph of this Chapter?
6. Create a density plot of the
2
distribution and make the main label say using the
expression() function (hint: this character is called chi).
7. In the previous graph, plot a dotted vertical line to indicate where the mean value of
the distribution is and put the character symbol next to it.
8. Using the aligned sequences to create a few different distance matrices by changing
the model type that you pass to the function dist.dna(). Do alternate distance models
have different densities of values? (Hint: plot a density plot for each distance matrix
on the same graph similar to what is shown in Figure 9.1).
9. Do these different distance models produce different tree topologies when using the
nj () function? If so, show the trees and describe the differences you see in the trees.
10. Do the functions nj () , fastme.bal(), and bionj () produce the same looking topologies? You
should read the functions to see what they are as you probably havent worked with
them yet. Explain.
Biological Data Analysis Using R
164 CHAPTER 9. WORKING WITH STRINGS
Biological Data Analysis Using R
Part III
Extending R
165
Chapter 10
Basic Scripts
When I use the term script here, what I am referring to is a set of R commands that you
put into a text le and have R evaluate. Learning how to write scripts will help you out
in the following ways:
1. In general we are all lazy. It seems to be a monumental task to type the same thing
into R over and over again. Scripts allow you to put your commands into a text le
and have R run them for you.
2. Keeping your analyses and data sets together is a great way for you to not loose
a record of what you have done. At a later date, you can come back and pick up
where you left off. If you have more data or another angle at the analysis, having a
record of how the previous analyses were performed is a huge benet.
3. There are times when you have to do the same thing over and over again, say make
graphs of a large number of variables or transform a lot of different data sets using
the same algorithm. If you put the commands in a script, and later when we get
into programming (Chapter 11) and functions (Chapter 12) you can run it over and
over again with ease (remember the lazy thing?).
So in essence, scripts are enablers for our laziness.
In this chapter, you will focus on the following topics:
Learn about basic script writing
Understand differences between code evaluated from a script and that same code
typed into the interactive R command line
Execute scripts in R
10.1 Writing Scripts
A script is nothing more than a series of commands that R recognizes and evaluates.
Within a script, you can dene data (Chapter 2), functions (Chapter 12), or other oper-
167
168 CHAPTER 10. BASIC SCRIPTS
ations. It is convenient to have a record of the commands that you use in R to produce
output.
10.1.1 Knowing Directories
A script must be in text and it must reside in a location where you can tell R it is located.
When you start an interactive session in R , it notes the current directory that you are
using. This is what is called the cwd or current working directory. Now if you are using R
from a GUI-ish installation such as on Windows , you have to tell R which directory to OS
use as a starting place. You can change the cwd from the Change dir... command in the
File menu. If you are staring R from a terminal (in OSX or some Unix variant), then
the directory where you started will be the cwd.
Here are a few tips that I nd helpful when I work with R :
It is a pretty good idea to keep your data sets and the scripts that you use to analyze
these data in the same directory. Use your descriptive skills in naming your data
and scripts such that you know what is contained in the le without looking at
it (e.g., perhaps a data set named DogwoodGerminationRates27.csv and the R
script as AnalysisOfDogwoodGermination.R; just makes it easier).
It is also a good idea to make sure that you separate your directories of data and
associated scripts such that it is easy for you to nd the right directory. Keeping
it all mashed together into a single directory can cause problems with data sets
having the same name (e.g., the infamous data.txt).
Always provide labels for each column of data. At some time in the future you will
need to look at the data set and gure out what that column of data represent.
In your scripts, provide a lot of comments. Lines that start with the hash character
(#) are ignored by R and you can use them for adding comments about what the
script, program, functions, or variables actually mean. I cannot emphasize this
enough. You will leave this class and at some point in the future look back on
some script you wrote and want to gure out how it works and without copious
comments you will fail and have a small sense of being genuine looser. You have
been warned.
10.1.2 The Editor
You can write a script in any basic text editor. For some installations of R , there is a
pseudo-GUI associated with it (e.g., Windows) because there is no real command line
terminal in the OS. This interface to R often has an integrated editor built into it and
if it is there you should probably use it unless you have another editor of choice that
you feel more comfortable with.
1
If you do not want to use the supplied editor or do not
have one available, you may want to check out TEXTMATE or TEXTWRANGLER on OSX,
1
There have literally been decades of wars fought over the choice of the real editor. If you are interested
in cultural aspects of programming and programmers (e.g., nerds like myself) re up a google search for vi
vs. emacs and sit back and enjoy.
Biological Data Analysis Using R
10.2. EVALUATING SCRIPTS 169
E or CRIMSON EDITOR (or the million others that are on this pedestrian platform) on
Windows, and for Unix/Linux you can use GEDIT, KATE, EMACS or VI (n.b. If you learn
one these last two you will never need another editor on any platform).
The important component of the editor that you are looking for is one that understands
R (or SPlus) and can provide you with syntax highlighting, parenthesis matching, and
automatic indentation. These are things that just make your life easier. After all, if you
are going to be spending a lot of time in front of your computer, you may as well have
tools that help instead of get in the way. Speaking of getting in the way, you should
never, under any circumstance, even think of using Word to do any of this.
OK, so open your editor and we will make a very small script that does something entire
useless. There is a data set named ScriptExampleData1.txt in the class folder. Make sure
you script is saved in the same directory as the data le. In R type the following code
and see what happens.
> theData < read . tabl e ( "ScriptExampleData1.txt" , header=T, sep=",")
> summary( theData )
Population Height Sex
A:5 Min. :23.40 Female:5
B:4 1st Qu.:27.70 Male :4
Median :29.70
Mean :30.04
3rd Qu.:32.70
Max. :38.20
> range ( theData$Height )
[ 1] 23.4 38.2
> l evel s ( theData$Population )
[ 1] "A" "B"
It should have loaded theData and provided a summary of it as shown. If not, you are
probably not in the correct directory. Change to the right directory and redo.
Now, take the same code and put it into your script le. Obviously, you do not want to
copy the responses that the R engine had provided to you, just the commands that you
typed. Save the script as AnalysisOfScriptData.R (note you must have the .R sufx on the
script le). Congratulations, you have written your rst script. In the next section we
will evaluate the script and note a few differences.
10.2 Evaluating Scripts
The R engine can load and evaluate scripts relatively easily. Take a look at the docu-
mentation for the source() command by typing ?source into R and give it a read. OK, ready?
In R type source("AnalysisOfScriptData.R") and see what happens... Nothing. Why is this?
The same commands produced lots of output when typed directly into R ...
The issue is that when you are typing commands into R you are doing so in an interactive
mode. You say do this and it says OK. However, when you are executing the contents
of a script, it is not entirely clear where output should go, another le, to the screen,
some other place. As a result, if you want to get a response from stuff in a script you
need to tell R to print the results. So for example, if you change your script to look
like:
Biological Data Analysis Using R
170 CHAPTER 10. BASIC SCRIPTS
theData < read . tabl e ( "ScriptExampleData1.txt" , header=T, sep=",")
pri nt ( summary( theData ) )
pri nt ( range ( theData$Height ) )
pri nt ( l evel s ( theData$Population ) )
and from R source it youll get:
> source ( "AnalysisOfScriptData.R")
Population Height Sex
A:5 Min. :23.40 Female:5
B:4 1st Qu.:27.70 Male :4
Median :29.70
Mean :30.04
3rd Qu.:32.70
Max. :38.20
[ 1] 23.4 38.2
[ 1] "A" "B"
Again, notice that here the output was only the response of the commands, the com-
mands themselves were not echoed to the R environment. You can get R to echo each
command and then provide the results when it is in a script by adding the optional
echo=TRUE option to the source() function as shown in the output below:
> source ( "AnalysisOfScriptData.R" , echo=TRUE)
> theData < read . tabl e ( "ScriptExampleData1.txt" , header=T, sep=",")
> pri nt ( summary( theData ) )
Population Height Sex
A:5 Min. :23.40 Female:5
B:4 1st Qu.:27.70 Male :4
Median :29.70
Mean :30.04
3rd Qu.:32.70
Max. :38.20
> pri nt ( range ( theData$Height ) )
[ 1] 23.4 38.2
> pri nt ( l evel s ( theData$Population ) )
[ 1] "A" "B"
This is helpful if you are debugging a script (e.g., guring out why it is crashing or giving
you the wrong answers).
So, in a script, things wont be printed out to the R terminal unless you tell it to. And
it is relatively appropriate to ask why you are wanting some things printed out as the
script is executing. The variables in a script are available in the main R memory so if
you dene a new variable in the script, after the rst time you source() it, you will have
access to it. However, because you can add variables to the main memory of R from a
script, I typically erase all variables from memory at the beginning of each script using
the command rm( list=ls() ). This way it is easy to see that the variable x you are working
with is the real one and not another x you had used two hours ago. This is a very
important point. Again, we are thinking about the future here and we need to make sure
that the things that we do in our analyses are reproducible at some point in the future.
Relying on variables that are outside our script and are only memory because we did
something before running our scripts will lead to frustration (bet on it!).
Biological Data Analysis Using R
10.3. ADDING COMMENTS TO YOUR CODE 171
In Chapter 9 there was a more complete discussion of how you can format your data for
printing. As you begin writing scripts right now, just focus on writing the routines that
you need to use to get an answer and later you can focus on making it look pretty.
10.3 Adding Comments To Your Code
Speaking of looking pretty, you must add comments to your code so that you remember
what is going on inside that le. To comment code in R you put a hash character at
the beginning of the section that you want to be commented. This will comment the line
from that point to the right. Everything to the left of the hash character is considered
code that will be evaluated.
x < 20 # thi s comment wi l l l et the assignment happen
# thi s i s a comment that spans multiple l i nes and won t
# be evaluated even i f i t has l ogi cal R code in i t
# x < 21
pri nt ( x )
Empty lines are also a nice feature to sprinkle through your scripts so that logical par-
titions can be identied. The R interpreter ignores all commented material and all lines
that do not have anything on them, so you are not penalized for not having it there.
Biological Data Analysis Using R
172 CHAPTER 10. BASIC SCRIPTS
10.4 Useful Functions
The following functions were introduced in this chapter and you will be required to use
them for the exercises. To get more information on any of these functions, use the R
help system.
# Indicates the start of a comment. The R interpreter ignores everything to the right
of this symbol.
rm(x) This function removes the variable x (or if x is a list of variable names all of
them) from memory.
source(x) This function causes R to look for the script named x and evaluate its
contents from start to nish. This works just as if you had typed in the lines of the
script with the exception of how variables are printed out to the terminal.
cat(x) This function dumps the contents of x to the GUI output as a single entity.
print(x) Send the contents of the variable x to the terminal output.
summary(x) Provides a summary of the variable x.
Biological Data Analysis Using R
10.5. EXERCISES 173
10.5 Exercises
The following exercises are meant to help you understand the items presented in this
Chapter.
1. How do you remove all variables from memory in the current workspace?
2. What happens when you set the optional argument verbose=TRUE when calling source?
3. Are you lazy?
4. Can R evaluate scripts that are written in Word or Excel?
5. How do you change the current working directory in R ?
6. How does the optional argument echo=TRUE change in the output of sourcing a script in
R ?
7. How would you print the summary of a data frame from within a script?
8. What character is used to indicate a comment?
9. How would you comment out several lines of code in a script?
10. Why is it important to comment your code?
Biological Data Analysis Using R
174 CHAPTER 10. BASIC SCRIPTS
Biological Data Analysis Using R
Chapter 11
Programming
Programming is the art of making a computer, who understands that it has to do only
exactly what you tell it to do, to get it to do the things you want to do. The language that R
uses for programming is derived from S-Plus and will be familiar looking to anyone who
has programmed in another language or seen other programming languages before.
In general, the majority of programming in R will be very linear. While it is possible to
program in an object-orientated fashion (and indeed it is not that bad of an implemen-
tation in my opinion), I wont be covering that in this book. The programs that Ill help
you build will have a start, proceed through a set of operations, have some conditional
statement, perhaps some loops, print out some stuff or save it to a le, and then exit. If
you have never programmed before you need to think about programming as a kind of
recipe, a very precise one. You need to think about the problem that you are going to
solve by writing a program. And then you need to think about the exact steps that you
will need to do to accomplish what you are attempting to do.
In this chapter, we will tackle a rather easy problem as a test case to show off how to
construct a very simple program. The problem that we are going to deal with is how to
measure canopy light from a hemispheric photo. An example of a Hemispheric photo is
given in Figure 11.1. This photo was taken by S.B. Weiss from the winter roosting habitat
of the monarch buttery in the Monarch Biosphere Reserve, Mexico. In this image, it is
easy to see the amount of canopy closure when taken from the hemispherical lens.
What we are going to do in this chapter is determine how much of that image is open
sky as a surrogate to measure available light in these forests. In the next few sections,
I will show some basic programming tools that we will use to write this program. Then
we will walk through the loading of an image and discuss how we get information from
and manipulate image data. Finally we will set out to write the program in a step-wise
fashion and nish with the completed program.
In this chapter, you will focus on the following topics:
Be introduced to some basic programming logic and the corresponding R grammar.
Develop a detailed pseudocode for a given program.
In a step-wise fashion, develop and test the program.
175
176 CHAPTER 11. PROGRAMMING
Figure 11.1: Hemispherical photograph of winter roosting habitat at Monarch Biosphere Reserve,
Mexico. Photo by S.B. Weiss made available by the Creative Commons Atribution 2.5
11.1 Looping
As mentioned in Chapter 2, R is primarily a vector language. The consequence of this
is that if you are looking for a language to do fast loops through a data set, R is not it.
In fact, Perl or Python would actually be faster to do looping-like algorithms. That being
said, there are reasons we occasionally need to use loops in R and here is a general
overview.
A loop, when referred to in a programming language, is a sequence of statements that
are repeated over and over again until some condition is reached. The items inside the
loop are typically contained within curly brackets (e.g.,{}).
11.1.1 The While
The while looping metric is a good one to use if you have a particular condition which you
want to check over and over again and perform some operations as long as the condition
is in one state. The while loop has the form while(COND){ <code goes here> }. The COND term
Biological Data Analysis Using R
11.2. CONDITIONAL STATEMENTS 177
in the parenthesis is evaluated as a logical statement each time you go through the loop
and will continue as long as COND=TRUE. When COND=FALSE, the loop exits and R starts to
evaluate statements after the closing curly bracket. There can be a lot of code between
the brackets.
The following example loops as long as x < 10 and prints out the value of x each time
through the loop.
> x < 0
> while ( x < 10 ) {
+ x < x + 1
+ cat ( x, " ")
+ }
1 2 3 4 5 6 7 8 9 10
When you start looping here, x = 0 and at each time through the loop, the variable x is
incremented and printed out on the console.
11.1.2 The For
Another common loop is one that actually focuses on the value of a counting variable
(e.g., the index in the loop). What this looping metric does is combine the initialization
of the condition variable (a counter) as a numeric value, increment the counter each
through the loop, and exits the loop when some condition on the counter is correct. The
general form of the for statement is for( COND ){ <code goes here> }. The COND can be one
of many different constructs that sets up a counting variable. Here are some examples
using the variable x.
> f or ( i in seq ( 0 , 9) ) {
+ cat ( i )
+ }
0123456789
> f or ( i in 0: 9) {
+ cat ( i )
+ }
0123456789
> x < seq( 0 , 9)
> f or ( i in seq ( length ( x ) ) ) {
+ cat ( x [ i ] )
+ }
0123456789
> f or ( i in x) {
+ cat ( i )
+ }
0123456789
For the COND the variable i is used as the counting variable along with the keyword in.
11.2 Conditional Statements
The next tool in your R programming toolbox is the conditional statement. Conditional
statements control the ow of logic through the a script or program. There are many
Biological Data Analysis Using R
178 CHAPTER 11. PROGRAMMING
cases where you would like to run some command or sets of commands if some condition
is true. For example,
if( CONDITION ) then RESPONSE
else if( OTHER_CONDITION ) then OTHER_RESPONSE
else FINAL_RESPONSE
Here the logic asks about the state of CONDITION, and OTHER CONDITION. If CONDITION is TRUE then
RESPONSE is done and none of the other conditions are evaluated nor are their responses
performed. The R interpreter just skips everything until the end of the set of condition-
als. If CONDITION is not TRUE but OTHER CONDITION is, then the only response to be performed
is OTHER CONDITION. If neither CONDITION nor OTHER CONDITION are true then FINAL RESPONSE is per-
formed. Note, only one response is ever performed each time.
In the example below, I set up a vector of boolean (TRUE|FALSE) variables and then loop
through them one at a time and see what they
> observations < as . l ogi cal ( c (TRUE, FALSE, FALSE, TRUE, TRUE) )
> observations
[ 1] TRUE FALSE FALSE TRUE TRUE
> f or ( obs in observations )
+ pri nt ( obs )
[ 1] TRUE
[ 1] FALSE
[ 1] FALSE
[ 1] TRUE
[ 1] TRUE
> f or ( obs in observations ) {
+ i f ( obs == TRUE )
+ cat ( obs , "it is true \n")
+ el se
+ cat ( "not\n")
+ }
TRUE i t i s true
not
not
TRUE i t i s true
TRUE i t i s true
We can also use conditional operators as a CONDITION in a if statement. In the example
below, we cycle through the numbers 1 through 10. And for each of them, we determine if
they are odd or even using the modulus operator %%. This operator returns the remainder
after a division.
> f or ( i in 1:10){
+ i f ( i %% 2 )
+ cat ( i , " is odd\n")
+ el se
+ cat ( i , " is even\n")
+ }
1 i s odd
2 i s even
3 i s odd
4 i s even
5 i s odd
6 i s even
7 i s odd
Biological Data Analysis Using R
11.2. CONDITIONAL STATEMENTS 179
8 i s even
9 i s odd
10 i s even
Each time through, the remainder of i %% 2 is evaluated. Possible values for this are
1 and 0 which when evaluated as.logical () , turn out to be either TRUE or FALSE printing the
appropriate message.
11.2.1 Bracketing
There is a little bit of bracket magic going on here and I should take the time to make
a few comments. Notice in the previous listing, there were brackets {} surrounding the
content inside the for loop. These brackets are essential because there is more than
one line of code inside the for loop. If there were only one line (see previous code listing
where print(obs) is the only code inside the for loop) then the enclosing brackets are
optional.
As a general rule, after any conditional (e.g., the if/else if/else) or loop (e.g., while/for) if
there is only one line of code then you do not need to use brackets if you do not want to.
Examples include:
> i f ( rnorm( 1) > 0.5 )
+ pri nt ( "greater")
> while ( TRUE )
+ pri nt ( "this will last forever")
This rule is recursive in that the one line of code is any line that is not a conditional
or a loop. In the next example, I loop through the numbers 1-10 and look for those
even numbers that are not divisible by 4 (n.b. I could have used a compound condi-
tional statement such as if( !(i%%2) && (i%%4)) but that would have really screwed up my
example).
> f or ( i in 1:10)
+ i f ( ! ( i %% 2 ) )
+ i f ( i %% 4 )
+ cat ( "the value=" , i , "\n")
the value= 2
the value= 6
the value= 10
In some sense, you can think of these kinds of one-liners as just extensions as one-
offs. There is nothing wrong with using brackets even in these cases. In fact, it may
open up your code a bit and make it a bit easier to read in the future. You just do not
have to use them.
However, where you want more than one statement to be executed after a loop or condi-
tional statement then you must use brackets. T
Biological Data Analysis Using R
180 CHAPTER 11. PROGRAMMING
11.3 Outlining A Program
The most difcult part of programming is understanding where to start. Writing a pro-
gram, on the surface, appears to be a daunting task in intself. However, when I write
programs I tend to think of them not as a single large program but as a series of smaller
steps. The key to doing this is to understand the sequence of steps that we need to
accomplish so that the program can do what is required.
So, rst things rst. State what you want the program to do in specic terms. For
this Chapter we will be working on developing a program that calculates the amount of
canopy openness from a hemispheric image (Figure 11.1). If you havent already done
so, I recommend that you look at Chapter 7 to refresh yourself on how we work with the
internals of an image.
Next, we need to get out a sheet of paper and write down, exactly, how the program is
going to work. It is important that we include all the steps necessary and in the order in
which they are to be performed. An example of this would be:
1. Load image into memory
2. Determine what parts of image are open canopy
3. Determine total area of image
4. Print out the proportion of canopy that is open.
So, each of these steps is a relatively easy one by itself and we will create the overall
program by breaking it up into manageable parts.
11.4 Creating A Program
It is often necessary to incrementally build a program. Using the outline in the previous
section, we can open a new le and create a script that does each of these items in suc-
cession. Typically, I nd it helpful to work on the R command line to test out particular
sets of commands and when I have it exactly like I like it then I move it to a script.
11.4.1 Step 1: Loading An Image Into Memory
In Chapter 7, we examined how to load images into memory, translate them into vari-
ous formats, and get into their knickers, so to speak. So to begin with, the image as I
retrieved it from Wikipedia is a JPEG image. I will begin by turning it into a PPM formatted
image as discussed in Chapter 7 using the program GIMP (http://www.gimp.org), al-
though you could use any image manipulation program and there are several free ones
available for you on the internets. The PPM le is what you have access to in the class
folder for Chapter 11.
> l i brary ( pixmap)
> img < read .pnm( f i l e ="Hemiphoto monarch habitat1.ppm")
Read 637563 items
Biological Data Analysis Using R
11.4. CREATING A PROGRAM 181
Figure 11.2: The blue channel of the
canopy picture displayed as a greyscale im-
age.
Figure 11.3: A histogram of values in the
blue channel (Figure 11.2).
> pl ot ( img)
Now we have the image loaded and a plot that is identical to that displayed in Figure
11.1 and we must gure out how to have it represented.
11.4.2 Step 2: What Is Open Canopy
The variable img has the following components and here we need to gure out what parts
of the image are the sky parts.
> names( attri butes ( img ) )
[ 1] "size" "cellres" "bbox" "bbcent" "channels" "red" "green"
[ 8] "blue" "class"
Remembering that there are three different channels in a PPM le, one for red, one for
green, and one for blue, perhaps we should look there rst. You can plot each of the
channels as an image by creating a pixmapGrey() image and see the intensity of each color
channel.
> pl ot ( pixmapGrey( img@blue ) )
> pl ot ( pixmapGrey( img@red ) )
> pl ot ( pixmapGrey( img@green ) )
And from this you will see that the different channels look pretty much the same when
evaluating the area that is considered the sky in this image. For our purposes, I will
we will only use the blue channel as displayed in Figure 11.2.
Biological Data Analysis Using R
182 CHAPTER 11. PROGRAMMING
So if that is the component of the image that we are going to use, we now need to deter-
mine which values to look for. To do this, you can easily make a histogram composed
of the values in the blue channel of the image using the command hist( img@blue ). We can
see from Figure 11.3 that there is a tremendous amount of values in this channel at the
low end, a peak at around 0.2 and another at the top end close to 1.0.
We can get a bit more specic with this image and plot the intensity of a particular row
of values in the blue channel to double check that we think values close to 1.0 should
represent light values and those near 0.0 are the dark regions. The following commands
create the image displayed in Figure 11.4 where the raw values along the 230
th
row of
pixels (indicated by the red dashed line) are shown in blue. It is easy to see that the
value in the blue channel gets larger as the dashed line crosses the image.
Figure 11.4: Intensity of blue channel values in the image as taken through a slice of the image
(at pixel row 230 as indicated by red dashed line).
> pl ot ( img, axes=T, bty="n" , xlab="Image Width" , ylab="Image Height")
> par ( new=T)
> abline (230,0, col ="red" , lwd=2, l t y=2)> par ( new=T)
> par ( new=T)
> pl ot ( img@blue[ 230 , ] , bty="n" , type="l" , xlab="" , ylab="" , col ="blue" ,
+ lwd=3, axes=F, ylim=c( 10,10))
So, at this point, we need to make a value judgement. We are fairly condent that values
close to one in the blue channel (and others you can go check yourself) represent areas
in the image where it is pretty light. But, we need to make a cut-off such that if we look
at a pixel, we can put it into the light or not-light category. For the purposes of this
exercise, I will assume that values that are 0.98 are to be considered as sky and I will
also make the restriction that I need the pixels in each channel to meet or exceed this
cut-off.
Now, to nd out how much of the image is sky (using this denition), we must:
Biological Data Analysis Using R
11.4. CREATING A PROGRAM 183
1. Loop through every matrix and the items in each matrix.
2. Evaluate if the value should be considered as sky or not.
3. Use a variable to keep track of all the pixels that meet the criteria
So to our script, we will add the following lines of code
> numRows < img@size [ 1]
> numCols < img@size [ 2]
> f or ( row in 1:numRows ) {
+ f or ( col in 1:numCols ) {
+ i f ( img@red[ row, col ] >= 0.98 &
+ img@green[ row, col ] >= 0.98 &
+ img@blue [ row, col ] >= 0.98 )
+ numSky < numSky + 1
+ }
+ }
> numSky
[ 1] 9624
So, in the image across all three color channels, we nd a total of 9, 624 pixels that can
be considered to represent the sky.
1
11.4.3 Step 3: Determine The Total Area Of The Image
OK, nally we are almost nished. We need to now determine what the total number of
pixels there are in the image so that we can get a standardized percent of open canopy.
We could use the total number of pixels 461
2
= 212, 521 but the image taken with the
sh-eye lens is not square, rather it is a circle that ts in a square whose side has 461
pixels. So, we need to gure out the area of this circle as:
> r < 461/2
> total Area < pi r 2
> total Area
[ 1] 166913.6
> (4612total Area ) /total Area
[ 1] 0.2732395
As a side note, the last expression in the code listing shows what percentage of area that
we would bias our estimation by if we just used the total number of pixels in the image,
27.3% is a reasonable sized bias!
11.4.4 Step 4: Print Out The Proportion Of Canopy That Is Sky
This part is fairly easy and doesnt require much.
1
While this part of the exercise was excellent at showing some of the programming paradigms and how
they can be combined to give an answer, it is also true that Step 2 can be accomplished in R using the
one-liner sum( img@blue >= 0.98 &img@green >= 0.98 &img@red >= 0.98 ). Here the three conditionals return a
vector of logical variables, which the function sum() coerces into integers. While it would have been much
shorter to do it this way, it would have negated all the quality teaching experiences that I was laying on
you...
Biological Data Analysis Using R
184 CHAPTER 11. PROGRAMMING
> numSky / total Area
[ 1] 0.05765857
11.4.5 The Complete Program
The complete program is listed below with comments. There are a few changes in the
program that I made to make it a bit easier to work with. Comments should be self
explanatory and are indicated by lines that start with the hash character (#).
# removes al l vari abl es from memory at st art of scri pt
rm( l i s t =l s ( ) )
# load the pixmap l i brary to open the image
l i brary ( pixmap)
# I put the f i l e name i nto a vari abl e so
# i t could be changed easi l y at the top
# of the f i l e i f necessary
fileName = "Hemiphoto monarch habitat1.ppm"
# I also put the cr i t er i a i nto a vari abl e
# so we can change i t in one place to see
# how the resul ts di f f er
skyCri teri a < 0.98
# Read in the image and f i nd the number of
# rows and columns in i t
img < read .pnm( f i l e =fileName )
numRows < img@size [ 1]
numCols < img@size [ 2]
# Loop through each row
f or ( row in 1:numRows ) {
# Loop through each column
f or ( col in 1:numCols ) {
# Evaluate the cel l in each f or
# sky cri t eri a
i f ( img@red[ row, col ] >= 0.98 &
img@green[ row, col ] >= 0.98 &
img@blue [ row, col ] >= 0.98 )
numSky < numSky + 1
}
}
# Find t ot al are of f i sheye ci r cl e
r < numRows/2
total Area < pi r 2
# Pri nt out the percent ca
percentCanopyOpen = numSky/total Area
cat ( Canopy Opening: , percentCanopyOpen, \n ) ;
11.5 Synopsis
This has been a very simple little program that we made. Despite it being simplistic, it
does show you how to go about creating a simple analysis program. R is not a general
Biological Data Analysis Using R
11.5. SYNOPSIS 185
programming language and you are not going to make large programs with it. The key
to R is knowing how to get something put together, take it a step at a time, and break
the components into reasonably sized, easy to accomplish pieces. This is where you
start.
In Chapter 12 we will build upon what has been done here when we discuss Functions.
We can encapsulate code into functions and make our lives much easier. For now, play
around with the program and the exercises and get comfortable with typing code.
Biological Data Analysis Using R
186 CHAPTER 11. PROGRAMMING
11.6 Useful Functions
The following functions were introduced in this chapter and you will be required to use
them for the exercises. To get more information on any of these functions, use the R
help system.
x %% y The modulus operator. This returns the remainder of the division x/y.
as.logical(x) Coerces x into a logical variable if possible. See 2.4.7 for more infor-
mation on logical variables.
rm(x) This function removes the variable x from memory
abline(a,b) This function plots a line with intercept of a and a slope of b in the current
graphics window.
for(INDEX SEQUENCE ) A main looping construct that specically uses the counter IN-
DEX that is contained in SEQUENCE.
while(COND) A looping construct that continues to loop until some condition is met.
As long as COND==TRUE the loop will continue.
if(COND) The evaluation of the condition COND. If it is TRUE then the next line following
the if statement is executed. If it is FALSE then the next line is skipped. You can
include several lines to be evaluated after this and other evaluation statements by
enclosing the code in curly brackets {}.
else if(OTHER COND) The second evaluation of a condition. This must not be the rst
conditional (e.g., there is an else here that implies a previous if or else if statement
that this is following).
else The last of a conditional, if all the previous ones did not turn out to be true,
then whatever follows the else will be evaluated. It is not necessary that you have
one of these at the end, you may want to not do anything unless some specic
conditions occur.
Biological Data Analysis Using R
11.7. EXERCISES 187
11.7 Exercises
The following exercises are meant to help you understand the items presented in this
Chapter.
1. Write a short program that lists all numbers from 1 to 100 and determines if they are
divisible by 2 and 3.
2. Write a program using the for () loop that prints the numbers from 42 down to 27, one
on each line.
3. List some of the assumptions that are included in how the variable numSky is deter-
mined.
4. Using the program we created in this Chapter, make a graph of percent canopy with
different cut-off values. In your opinion what would be the most biologically mean-
ingful cutoff?
5. Change the program to use a cutoff value based on the sum of the individual color
channel values rather than the current requirement that they all be simultaneously
over some threshold.
6. How many lines of output do you expect to get from the following code? HINT: Think
before you try to run this program.
while ( 1) {
cat ( "All work and no play makes Dr. D a dull boy.\n" ) ;
}
7. Create an outline of the steps that would nd the number of values in a matrix that
is equal to or greater than 20.
8. Implement the program you outlined using the matrix M <matrix( runif(25,10,30), nrow=5) as
your input. Make sure to comment your code appropriately.
9. What is the proper syntax for conditions passed to an if statement requires x to be
greater than 23 and y to be equal to or less than 4?
10. How many else statements can you have after an if statement?
Biological Data Analysis Using R
188 CHAPTER 11. PROGRAMMING
Biological Data Analysis Using R
Chapter 12
Functions
Throughout this book, weve used both built-in function such as sqrt() and sum() as well as
some that are located in external libraries that we had to load (such as skewness() in 4.3.1
and read.pnm() in 7.2). These functions have been really helpful in making you scripts
look clean and readable and have made you life rather easy as you performed some
basic statistical analysis. Think what a pain it would have been if you had to write code
every time you wanted to calculate a sqrt() of a number... (Im not even sure how it is
done).
Writing your own functions in R is a very useful way to save a lot of typing. You can
consider a function a small self-contained bundle of instructions that you can call when
every you need to. Say you are picky about the way your graphics look, or that there
is a particular set of routines that you use to make translations of your data from one
format to another. Putting this code into function and putting that function in a location
where you can get access to when every you need it is a real treat.
In this Chapter you will learn the following skills:
Learn the syntax required to write your own functions.
Understand the scope of a variable and why you should care.
Create a basic library of routines that you can use in the future.
12.1 Function Syntax
The format of a function basically has the following three parts:
1. The name of the function. The creation of a name for a function is just as important
as for a variable. I nd it helpful to try to make the name tell me what the function
does (Im funny that way), which means it typically starts with a verb such as
convertMissingData(), removeLameExcuses(), or makeTheGraphTheWayILikeIt().
2. The assignment to the name. Right after the name you will have the assignment of
the generic function() function to the variable (see the syntax below). This tells R that
the name is not a variable but will actually be the name of a function.
189
190 CHAPTER 12. FUNCTIONS
3. The function contents. This is the part that you get to write. Here is where you put
all the stuff together to do whatever it needs to do.
In general, these three parts are put together to look like:
doMyBidding < function ( ) {
# Function Contents
}
Now this is fairly boring function here, it takes no arguments and doesnt return any-
thing to you. It is kind of like saying, R go to your special place and do something
but dont tell me what it is. As you write functions, they will be considerably more
complicated (and hopefully useful).
In this Chapter I will post in the raw code for the function itself followed by the output
of R from the command line. The straight posting of the function syntax allows you to
cut-and-paste them into the R interpreter (even though you will learn it better by typing
it).
Also, functions that you have dened are available in the local memory of the interpreter
in the same way as local variables are. If you use the ls () command to list the items in
memory it will show your function names along side your variable names.
> l s ( )
[ 1] "doMyBidding" "x"
12.1.1 Returning Values From A Function
Most likely you are calling some function because you are interested in getting a re-
sponse to it. It is not common to write functions that do no give you something back in
return.
To return a value from a function, R has you put the name of the variable on the last
line of the function. An example of this is the following function that returns a single
number.
gimmeANumber < function ( ) {
42
}
> gimmeANumber ( )
[ 1] 42
> gimmeANumber ( )
[ 1] 42
And a slightly better function here that actually returns a random number:
gimmeAnotherNumber < function ( ) {
x < runi f (1,1,100)
x
}
> gimmeAnotherNumber ( )
[ 1] 87.3278
> gimmeAnotherNumber ( )
[ 1] 64.97312
Biological Data Analysis Using R
12.1. FUNCTION SYNTAX 191
You can also use the return return() to exit the function and potentially return a value.
Here is an example that checks to see if the passed argument is the right kind, if it is
not it prints an error and returns, otherwise it performs a calculation and then returns
the result.
gimmeHalf < function ( theValue ) {
# check to see i f i t i s a numeric value
# i f i t i s the return hal f
i f ( i s . numeric ( theValue ) ) {
return ( theValue / 2. 0)
}
# i f i t isn t then complain
el se {
cat ( "The value" , theValue , "is not a number, try again.\n")
return ( )
}
}
> gimmeHalf ( 12 )
[ 1] 6
> gimmeHalf ( "Hello partner! " )
The value Hello partner ! i s not a number, try again.
NULL
Notice here that when the function left the else section of the function by calling the
return() without any arguments then the function actually returned the NULL value. If you
are not interested in having a function return NULL, something that signals to you that
the value passed to the function may be incorrect then you can remove the last return()
statement and have the function not return anything. Here is what that function would
look like.
gimmeHalf < function ( theValue ) {
# check to see i f i t i s a numeric value
# i f i t i s the return hal f
i f ( i s . numeric ( theValue ) )
return ( theValue / 2. 0)
# i f i t isn t then complain
el se
cat ( "The value" , theValue , "is not a number, try again.\n")
}
> gimmeHalf ( 14)
[ 1] 7
> gimmeHalf ( "bob")
The value bob i s not a number, try again.
Vector Arguments
By default you function above can work on vectors of values just as easy as single
numbers. This is because a vector of numbers will return TRUE when asked if it is.numeric()
(see 2.4.8 for more on this). Here is an example,
> x < seq(2,20, by=3)
> i s . numeric ( x )
[ 1] TRUE
Biological Data Analysis Using R
192 CHAPTER 12. FUNCTIONS
> i s . vector ( x )
[ 1] TRUE
> x
[ 1] 2 5 8 11 14 17 20
> gimmeHalf ( x )
[ 1] 1.0 2.5 4.0 5.5 7.0 8.5 10.0
So by default, you can work with vectors of your values just as easy as single numbers.
This is pretty cool and you should try to remember the love that R has for vector oper-
ations because it is much faster to call your gimmeHalf() function by passing it vector of
value than using a loop to go through the vector and calling gimmeHalf() for each individual
value...
Here is a slightly longer example of a function. Notice that inside the function, I have
added some comments. This is a very good idea because it allows you to document what
you are doing inside the function. In fact, I typically write functions by:
1. Write the signature of the function, the funcName <function(){ } part.
2. Using comments, write the sequence of events that have to occur inside the function
so I can see what needs to be done (breaking large problems into small ones here)
3. Fill in the code to allow R to do my bidding.
So lets walk through these steps and make a function. The purpose of this function is
to get a little encouragement for my programming endeavors by having R return some
nice praise for me.
Step 1: Create signature The signature for this function will be:
giveMeSomeMomLove < function ( ) {}
Step 2: Using comments create logic of function: The overall goal of this function is to
return a random statement from my mother so I will have to set up some statements,
nd a random one,, and then return it.
giveMeSomeMomLove < function ( ) {
# set up a vector of l ovi ng mother sayings
# pick a random number to use as index f or responses
# I f you put the name vector and the index on the l ast l i ne
}
Step 3: Fill in the R logic: Now that I have the comments set out, it is fairly easy for me
to use them as a guide in laying out the logic of function. You do not have to document
every line of code in your functions, but if you put in enough so that it is obvious what is
going to happen next, you will nd yourself being happy with your past self more often
than hating what you had forgotten to do (?).
giveMeSomeMomLove < function ( ) {
# set up a vector of l ovi ng mother sayings
momSayings < c ( "Honey, your dad and I think you are doing just fine." ,
"Come home this weekend, I made your favorite dessert." ,
"We think you are the BEST student at VCU." ,
"You know I took calculus back in college, maybe I can help." ,
Biological Data Analysis Using R
12.1. FUNCTION SYNTAX 193
"I just know youll be able to find a good job after college.")
# pick a random number to use as index f or responses
resp = round( runi f ( 1 , 1, length ( momSayings) ) )
# I f you put the name vector and the index on the l ast l i ne
momSayings[ resp ]
}
> giveMeSomeMomLove ( )
[ 1] "We think you are the BEST student at VCU."
> giveMeSomeMomLove ( )
[ 1] "Honey, your dad and I think you are doing just fine."
> giveMeSomeMomLove ( )
[ 1] "You know I took calculus back in college, maybe I can help."
Feel free to add some of your own mother sayings here
12.1.2 Passing Values To A Function
The most common way you will interact with a function is probably by giving it some
variables and expecting to get something back.
getI denti tyMatri x < function ( numRows ) {
# make a square matrix with al l zeros
I < matrix ( 0, nrow=numRows, ncol=numRows )
# make the diagonal al l ones
diag ( I ) < 1
# return i t to the cal l er
I
}
> getI denti tyMatri x ( 2)
[ , 1] [ , 2]
[ 1 , ] 1 0
[ 2 , ] 0 1
> getI denti tyMatri x ( 5)
[ , 1] [ , 2] [ , 3] [ , 4] [ , 5]
[ 1 , ] 1 0 0 0 0
[ 2 , ] 0 1 0 0 0
[ 3 , ] 0 0 1 0 0
[ 4 , ] 0 0 0 1 0
[ 5 , ] 0 0 0 0 1
Default Values
Functions can have default values associated with variables that are passed to them.
Weve seen this many times so far as youve looked up and seen the function signatures
of built in variables. This is a very convenient feature for you and your users. In general,
when you think of writing functions you should not try to make them so specic that
you have a lot of different functions that do almost the same thing, rather you should
make them robust and if you can combine a few functions into a single one whose
values change depending upon a parameter you pass to it, it is better overall form. For
example, the function getIdentityMatrix() returns a square matrix with ones down the
diagonal. This matrix is a pretty special one (see ??) in matrix analysis and probably
Biological Data Analysis Using R
194 CHAPTER 12. FUNCTIONS
should have its own function just because of its status. However, there are a number of
reasons why you may need a square matrix with a single value down the diagonal and
perhaps it would be more robust to create a function such as:
getDiagonalMatrix < function ( si ze , value=1 ) {
theMat < matrix ( 0 , nrow=si ze , ncol=si ze )
diag ( theMat ) < value
theMat
}
> getDiagonalMatrix ( 3)
[ , 1] [ , 2] [ , 3]
[ 1 , ] 1 0 0
[ 2 , ] 0 1 0
[ 3 , ] 0 0 1
> getDiagonalMatrix ( 3 , 42)
[ , 1] [ , 2] [ , 3]
[ 1 , ] 42 0 0
[ 2 , ] 0 42 0
[ 3 , ] 0 0 42
Now this function has a default value to set the diagonal values to (e.g., 1) producing the
Identity matrix I by default, however, it can also produce any diagonal matrix when you
pass an additional parameter to the function. If you do not pass it to the function, it
is assigned in the signature for you by default. This makes the function perhaps more
robust and useful. Of course, this is all up to you, you are the programmer here and you
get to make the decisions. After all, there are several different ways to get the correct
result when programming and as Biologists, we should focus on the biology and use
tools like R as simple tools.
12.2 Scope
The scope of a variable determines the value that it has depending upon where it is
located. This topic is a pretty important one and can be a bit tricky at times.
myFunc < function ( x) {
x < 42
cat ( "x inside function is" , x, "\n")
}
> x < 21
> x
[ 1] 21
> myFunc( x )
x i nsi de i s 42
> x
[ 1] 21
myFunc < function ( a) {
x < 42
cat ( "other x inside function is" , x, "\n")
}
> x < 23
> myFunc( x )
other x i nsi de function i s 42
> x
[ 1] 23
Biological Data Analysis Using R
12.3. USEFUL FUNCTIONS 195
12.3 Useful Functions
The following functions were introduced in this chapter and you will be required to use
them for the exercises. To get more information on any of these functions, use the R
help system.
function(args)code Creates a function that has the code inside code requiring the ar-
guments args.
return(x) Returns the value x from the function which means it is immediately exited
and no more code is executed in the function.
Biological Data Analysis Using R
196 CHAPTER 12. FUNCTIONS
12.4 Exercises
The following exercises are meant to help you understand the items presented in this
Chapter.
1. Create a function that allows you to pass it a regression model and it will return a
string that contains the formula for the model as you would like to have it displayed
on a graph.
2. Create a function that takes a single vector of values and creates a histogram and
density line from that data in a new graphics window.
3. Explain scope and how it pertains to the values assigned to variables.
4. Create a function that takes an ANOVA or Regression model and saves the ANOVA
table to a le. You should probably allow the user to pass a le name to the function.
5. How do you set default values for a function when you write it?
6. Explain how you get your functions to accept vector arguments.
7. Create a function that returns random numbers but allow the user to set an optional
argument that will only return even numbers.
8. How would you remove a function from the memory of R ?
9. Lets assume that you have a folder full of data les named Data1, Data2, Data3, . . .,
Data40. Write a function that creates these le names dynamically. You will want to
allow the user to specify the base name of the les (e.g., Data) as well as the starting
and ending numbers (e.g., 1 and 40) but set the starting number to default to 0.
10. How do you make sure that the arguments that are passed to your functions are the
right kind of variables? For example, what if I passed the variable x <"this is the end"
to a function that expects a number.
Biological Data Analysis Using R
Appendix A
Answers to Exercises
In this section you will nd answers to the odd numbered Exercises presented in each
Chapter. These answers are meant to help you start on the exercises facilitating your
completion of the remaining questions. It is my recommendation that you look at the
answers only after you have completed them just to make sure that what you thought
you were doing is the correct thing. Dont look ahead....
Ansers to Chapter 2.
197
198 APPENDIX A. ANSWERS TO EXERCISES
Biological Data Analysis Using R
Appendix B
Installing Additional Libraries
The R statistical computing environment is made more robust by the addition of external
libraries. Libraries can be written in R , C, or FORTRAN by you or other people who want to
expand the functionality and utility of R .
B.1 Library Availability
There is a list of libraries available at http://cran.r-project.org. As of the time of this
writing, there are currently 1621 different packages in the repository. All are available
for you to install and use at your discression. Each should also come with a set of
documentation covering all the functions that are included in the library, descriptions of
the data sets, and some overall discussions on the library along with the library.
B.2 Installing Libraries
B.2.1 Using install.packages() As A GUI
The easiest way for you to install a libarary is to do so from within R itself. To do this,
your machine must be connected to the internet. R knows how to nd, download, and
install binary versions of packages using a tck/tk interface GUI interface.
If you conduct the installation as a normal user that does not have administrative priv-
ilages on your computer, the libraries will be installed in a location that is in your own
home directory. Depending upon which operating system you have, this will be in dif-
ferent places. The main thing to worry about here is that when you install libraries into
your own directory they will only be available to that user and will not be available for
any other users on that machine. If two people use the same machine then they will have
to install it twice, once in each home directory. Conversely, if you have administrative
privelages on the machine you are using, you can install the libraries into a location that
everyone that uses that machine can access.
To start the installation process, issue the command:
199
200 APPENDIX B. INSTALLING ADDITIONAL LIBRARIES
> i nst al l . packages ( )
And this will bring up a window (using tck/tk so it wont look quite like the normal
window on your operating system) that allows you to select which mirror you would
like to use for downloading. An example window is shown in Figure B.2.1. In general,
you should select a location that is geographically proximite to your current location.
All of these mirror servers are kept up-to-date pretty well and you shouldnt nd any
differences among the packages on any of them.
Once you have selected your preferred mirror server, another window will be presented
(resembling Figure B.2.1) that lists all the packages that are available to be installed.
Be careful here, this simple interface does not check to see which packages you already
have installed, it only lists all the packages that are at your disposal. So just because
there is a package on that list doesnt mean that you do not already have it installed on
your machine.
Select the package, or packages, that you want to install from the list. To select more
than one, click on more than one... To deselect a package, click on it a second time and it
will be deslected. Once you hit the OK button on this window, the install .packages() function
will look to see what dependencies the selected packages have (e.g., PackageA requires
PackageB but you didnt know that and didnt select it). Packages will be downloaded and
installed in the correct location. After they are installed, you should be able to use them
immediately (e.g., without restarting R ).
B.2.2 Using install.packages() For Specic Libraries
If you know the name of the package that you are interested in installing you can use
the install .packages() function directly by passing it a name, or list of names, of the packages
you are interesed in. This will skip the Package Selection Window step shown in Figure
B.2.1. The syntax for this would be:
> i nst al l . packages ( "theNameOfTheLibraryNeeded")
Libraries have also be partitioned into different Task Views. These are meta-packages
that contain several different packages under a particular theme. Below are a list of the
views that are available as of January 2009 (these categories and desriptions are lifted
directly from the website.
Bayesian Bayesian Inference
ChemPhys Chemometrics and Computational Physics
Cluster Cluster Analysis & Finite Mixture Models
Distributions Probability Distributions
Econometrics Computational Econometrics
Environmetrics Analysis of Ecological and Environmental Data
ExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental Data
Biological Data Analysis Using R
B.2. INSTALLING LIBRARIES 201
Figure B.1: Example of CRAN mirror window as viewed on Linux
Biological Data Analysis Using R
202 APPENDIX B. INSTALLING ADDITIONAL LIBRARIES
Figure B.2: All packages that can be installed from the selected mirror server on my machine.
Biological Data Analysis Using R
B.2. INSTALLING LIBRARIES 203
Finance Empirical Finance
Genetics Statistical Genetics
Graphics Graphic Displays & Dynamic Graphics & Graphic Devices & Visualization
gR gRaphical Models in R
MachineLearning Machine Learning & Statistical Learning
Multivariate Multivariate Statistics
NaturalLanguageProcessing Natural Language Processing
Optimization Optimization and Mathematical Programming
Pharmacokinetics Analysis of Pharmacokinetic Data
Psychometrics Psychometric Models and Methods
Robust Robust Statistical Methods
SocialSciences Statistics for the Social Sciences
Spatial Analysis of Spatial Data
Survival Survival Analysis
TimeSeries Time Series Analysis
You can install all the libraries in these particular views by invoking the command:
> i nst al l . packages ( "ViewName")
You will still have to specify the mirror server to use and once you do, R will take it
from there. This could be a lengthy process as it may require numerous packages to be
downloaded and installed. Be patient.
B.2.3 From the Command Line
Finally, there is one other method that I typically use on my machines. This is because I
typically download the source packages rather than the pre-compiled binaries. However,
this method also works with binaries. You can download the package from the CRAN
site directly and then open a command-line Terminal and change to the directory where
the package is located. From there issue the command:
R CMD INSTALL ThePackageYouDownloaded.tar.gz
and R will install it for you. If you do this as the root or administrator person, it will
install it in a globally accessable location so any user on that machine will have access
to it.
Biological Data Analysis Using R
204 APPENDIX B. INSTALLING ADDITIONAL LIBRARIES
Biological Data Analysis Using R
Bibliography
Caswell, H. (2001). Matrix population Models: Construction, Analysis, and Interpretation.
Sinauer Associates, Sunderland, Mass., 2nd edition edition.
205
Index
class, 115
clustal le, 153
coercion, 9
comment character (#), 172
data types, 8
character, 10
complex, 11
constant, 11
data frame, 18, 26
factors, 16
integer, 8
list, 17
logical, 13
matrix, 14
NULL, 31
numeric, 9
raw, 12
vector, 13
distributions
dchisq, 43, 68
df, 43, 68
dnorm, 43
pchisq, 43, 54
pf, 43
pnorm, 43
qchisq, 43, 44, 54
qf, 43, 46
qnorm, 43
qt, 46
rchisq, 43
rf, 43
rnorm, 43, 54, 58, 68
rpois, 63
runif, 65, 107
fasta le, 152
gure
axis labels, 56
title, 56
functions, 6
%%, 186
abline, 186
any, 150, 161
as.factor, 72, 86
as.index, 186
as.matrix, 86, 144
as.matrix(), 121
attributes, 33
barf, 123
barplot, 137
binom.test, 73
c, 86
cat, 118, 161, 172
cbind, 29, 40, 86, 107
class, 20, 33
colnames, 86, 157
components, 6
cov, 64
density, 57, 58
det, 128
diag(), 126
dim, 123
dist.dna, 154
eigen, 132
else, 186
else if, 186
expression, 161
for, 186
format, 161
function, 195
gimeMeSomeMomLove, 192
ginv, 128
grep, 150
grey, 118
gsub, 150
image, 118
index, 127, 186
kurtosis, 60
length, 20, 86
206
INDEX 207
levels, 17
lm, 128
load, 32, 40
log, 7
ls, 32
matrix, 121, 145
max, 61, 118
mean, 58, 129
merge, 39, 40
min, 61
names, 33
nchar, 148, 161
nj, 154
par, 49
paste, 11, 20, 95, 149
plot, 47, 155
print, 172
q, 31
qchisq, 45
range, 50, 61, 81, 86
rbind, 28, 40
read.dna, 153
read.table, 27, 28, 86, 145
read.table(), 121
rep, 14, 20
return, 191, 195
rexp, 54
rm, 32, 40, 172, 186
rnorm, 54, 56, 118
round, 107
row.names, 33
rownames, 86, 157
rpois, 52, 54
save, 32, 40
sd, 58
seq, 14, 20
skewness, 59
source, 172
strsplit, 148
sub, 150
subset, 35, 36, 40
substring, 149, 161
sum, 127
summary, 17, 20, 86, 93, 107,
172
t, 128
table, 17, 68, 72
unlist, 148, 162
var, 58
while, 186
genetic distance, 154
grahics
pdf, 51
graphics
abline, 94, 107
barplot, 137, 144
bg, 48
bmp, 51
bty, 48
bxp, 85
cairo pdf, 51
cex, 48
col, 48
density plot, 57
dev.copy, 52, 53
dev.off, 52, 53
fg, 48
hist, 52, 55
jpeg, 51, 53
legend, 142, 145
line plot, 47
lty, 48
lwd, 48
main, 48
mfrow, 48, 61
optional parameters, 48
overlaid, 49
par, 48
pch, 48, 107
pictex, 51
plot, 46, 68, 85
png, 51
postscript, 51
quartz, 51
rug, 104
scatter plot, 46, 47
sub, 48
text, 107
tiff, 51
topo.colors, 52
type, 48
x11, 51
xlab, 48
xlim, 49
ylab, 48
Biological Data Analysis Using R
208 INDEX
ylim, 49
matrix
%*%, 144
addition, 123
det, 144
diag, 144
diagonal, 126
dim, 144
eigen, 145
element-wise multiplication, 124
ginv, 145
Hadamard product, 124
multiplication, 124
scalar addition, 123
scalar multiplication, 124
scalar subtraction, 123
Schur product, 124
subtraction, 123
t, 145
trace, 127
Neighbor Joining, 154
operator
assignment, 18
logical, 19
numerical, 18
operator order, 18
Pinaceae, 153
stats
anova, 93, 107
aov, 107
binom.test, 86
chisq.test, 72, 76, 86
cor.test, 67, 79, 86
interaction formula, 99
Kruskal-Wallis Test, 82, 83
kruskal.test, 86
lm, 92, 107
Mann-Whitney, 80
mean, 58, 68, 81, 86
median, 63
nj, 162
no intercept, 100
quantile, 63
sd, 68
step, 107
t.test, 107
TukeyHSD, 107
var, 58, 68
Wilcoxon, 80
Wilcoxon Test, 81
variable, 7
Biological Data Analysis Using R