An Introduction To R: W. N. Venables, D. M. Smith and The R Development Core Team
An Introduction To R: W. N. Venables, D. M. Smith and The R Development Core Team
An Introduction To R: W. N. Venables, D. M. Smith and The R Development Core Team
W. N. Venables, D. M. Smith
and the R Development Core Team
Copyright
c 1990 W. N. Venables
Copyright
c 1992 W. N. Venables & D. M. Smith
Copyright
c 1997 R. Gentleman & R. Ihaka
Copyright
c 1997, 1998 M. Maechler
Copyright
c 1999–2001 R Development Core Team
Permission is granted to make and distribute verbatim copies of this manual provided the
copyright notice and this permission notice are preserved on all copies.
Permission is granted to copy and distribute modified versions of this manual under the con-
ditions for verbatim copying, provided that the entire resulting derived work is distributed
under the terms of a permission notice identical to this one.
Permission is granted to copy and distribute translations of this manual into another lan-
guage, under the above conditions for modified versions, except that this permission notice
may be stated in a translation approved by the R Development Core Team.
i
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
8 Probability distributions . . . . . . . . . . . . . . . . . . 34
8.1 R as a set of statistical tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.2 Examining the distribution of a set of data . . . . . . . . . . . . . . . 35
8.3 One- and two-sample tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
11 Statistical models in R . . . . . . . . . . . . . . . . . . . 52
11.1 Defining statistical models; formulae . . . . . . . . . . . . . . . . . . . . 52
11.1.1 Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
11.2 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
11.3 Generic functions for extracting model information . . . . . . 55
11.4 Analysis of variance and model comparison . . . . . . . . . . . . . . 56
11.4.1 ANOVA tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
11.5 Updating fitted models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
11.6 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
11.6.1 Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
11.6.2 The glm() function . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.7 Nonlinear least squares and maximum likelihood models . . 62
11.7.1 Least squares. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
11.7.2 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 63
11.8 Some non-standard models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
iv
12 Graphical procedures . . . . . . . . . . . . . . . . . . . . 65
12.1 High-level plotting commands . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12.1.1 The plot() function . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12.1.2 Displaying multivariate data . . . . . . . . . . . . . . . . . . . 66
12.1.3 Display graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
12.1.4 Arguments to high-level plotting functions . . . . . . 67
12.2 Low-level plotting commands . . . . . . . . . . . . . . . . . . . . . . . . . . 68
12.2.1 Mathematical annotation . . . . . . . . . . . . . . . . . . . . . . 70
12.2.2 Hershey vector fonts . . . . . . . . . . . . . . . . . . . . . . . . . . 70
12.3 Interacting with graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
12.4 Using graphics parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
12.4.1 Permanent changes: The par() function . . . . . . . 71
12.4.2 Temporary changes: Arguments to graphics
functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
12.5 Graphics parameters list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
12.5.1 Graphical elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
12.5.2 Axes and tick marks . . . . . . . . . . . . . . . . . . . . . . . . . . 73
12.5.3 Figure margins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
12.5.4 Multiple figure environment . . . . . . . . . . . . . . . . . . . 75
12.6 Device drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
12.6.1 PostScript diagrams for typeset documents . . . . . 77
12.6.2 Multiple graphics devices . . . . . . . . . . . . . . . . . . . . . . 78
12.7 Dynamic graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A A sample session . . . . . . . . . . . . . . . . . . . . . . . . . 80
B Invoking R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.1 Invoking R under UNIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
B.2 Invoking R under Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
B.3 Invoking R on a Macintosh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
E Concept index . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
F References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Preface 1
Preface
This introduction to R is derived from an original set of notes describing the S and
S-Plus environments written by Bill Venables and David M. Smith (Insightful Corpora-
tion). We have made a number of small changes to reflect differences between the R and S
programs, and expanded some of the material.
We would like to extend warm thanks to Bill Venables for granting permission to dis-
tribute this modified version of the notes in this way, and for being a supporter of R from
way back.
Comments and corrections are always welcome. Please address email correspondence to
R-core@r-project.org.
classical and modern statistical techniques have been implemented. Some of these are built
into the base R environment, but many are supplied as packages. (Currently the distinction
is largely a matter of historical accident.) There are about 8 packages supplied with R
(called “standard” packages) and many more are available through the cran family of
Internet sites (via http://cran.r-project.org).
Most classical statistics and much of the latest methodology is available for use with R,
but users will need to be prepared to do a little work to find it.
There is an important difference in philosophy between S (and hence R) and the other
main statistical systems. In S a statistical analysis is normally done as a series of steps,
with intermediate results being stored in objects. Thus whereas SAS and SPSS will give
copious output from a regression or discriminant analysis, R will give minimal output and
store the results in a fit object for subsequent interrogation by further R functions.
> q()
At this point you will be asked whether you want to save the data from your R session.
You can respond yes, no or cancel (a single letter abbreviation will do) to save the
data before quitting, quit without saving, or return to the R session. Data which is
saved will be available in future R sessions.
Further R sessions are simple.
1. Make ‘work’ the working directory and start the program as before:
$ cd work
$ R
2. Use the R program, terminating with the q() command at the end of the session.
To use R under Windows the procedure to follow is basically the same. Create a folder
as the working directory, and set that in the ‘Start In’ field in your R shortcut. Then
launch R by double clicking on the icon.
> example(topic)
Windows versions of R have other optional help systems: use
> ?help
for further details.
3
The leading “dot” in this file name makes it invisible in normal file listings in UNIX.
Chapter 2: Simple manipulations; numbers and vectors 7
1
With other than vector types of argument, such as list mode arguments, the action of c() is rather
different. See Section 6.2.1 [Concatenating lists], page 28.
2
The underscore character, ‘_’ is an allowable synonym for the left pointing assignment operator ‘<-’,
however we discourage this option, as it can easily lead to much less readable code.
3
Actually, it is still available as .Last.value before any other statements are executed
Chapter 2: Simple manipulations; numbers and vectors 8
The vector assigned must match the length of the index vector, and in the case of a
logical index vector it must again be the same length as the vector it is indexing.
For example
> x[is.na(x)] <- 0
replaces any missing values in x by zeros and
> y[y < 0] <- -y[y < 0]
has the same effect as
> y <- abs(y)
1
numeric mode is actually an amalgam of two distinct modes, namely integer and double precision, as
explained in the manual.
2
Note however that length(object) does not always contain intrinsic useful information, e.g., when object
is a function.
Chapter 3: Objects, their modes and attributes 14
3
In general, coercion from numeric to character and back again will not be exactly reversible, because of
roundoff errors in the character representation.
Chapter 3: Objects, their modes and attributes 15
1
Foreign readers should note that there are eight states and territories in Australia, namely the Australian
Capital Territory, New South Wales, the Northern Territory, Queensland, South Australia, Tasmania,
Victoria and Western Australia.
Chapter 4: Ordered and unordered factors 17
but is otherwise identical to factor. For most purposes the only difference between ordered
and unordered factors is that the former are printed showing the ordering of the levels, but
the contrasts generated for them in fitting linear models are different.
Chapter 5: Arrays and matrices 19
5.1 Arrays
An array can be considered as a multiply subscripted collection of data entries, for ex-
ample numeric. R allows simple facilities for creating and handling arrays, and in particular
the special case of matrices.
A dimension vector is a vector of positive integers. If its length is k then the array is
k-dimensional, i.e., a matrix is a 2-dimensional array. The values in the dimension vector
give the upper limits for each of the k subscripts. The lower limits are always 1.
A vector can be used by R as an array only if it has a dimension vector as its dim
attribute. Suppose, for example, z is a vector of 1500 elements. The assignment
> dim(z) <- c(3,5,100)
gives it the dim attribute that allows it to be treated as a 3 by 5 by 100 array.
Other functions such as matrix() and array() are available for simpler and more natural
looking assignments, as we shall see in Section 5.4 [The array() function], page 21.
The values in the data vector give the values in the array in the same order as they
would occur in FORTRAN, that is “column major order,” with the first subscript moving
fastest and the last subscript slowest.
For example if the dimension vector for an array, say a is c(3,4,2) then there are
3 ∗ 4 ∗ 2 = 24 entries in a and the data vector holds them in the order a[1,1,1], a[2,1,1],
..., a[2,4,2], a[3,4,2].
perm[j] becoming the new j-th dimension. The easiest way to think of this operation is
as a generalization of transposition for matrices. Indeed if A is a matrix, (that is, a doubly
subscripted array) then B given by
> B <- aperm(A, c(2,1))
is just the transpose of A. For this special case a simpler function t() is available, so we
could have used B <- t(A).
1
Note that x %*% x is ambiguous, as it could mean either x0 x or xx0 , where x is the column form. In such
cases the smaller matrix seems implicitly to be the interpretation adopted, so the scalar x0 x is in this
case the result. The matrix xx0 may be calculated either by cbind(x) %*% x or x %*% rbind(x) since the
result of rbind() or cbind() is always a matrix.
Chapter 5: Arrays and matrices 24
6.1 Lists
An R list is an object consisting of an ordered collection of objects known as its compo-
nents.
There is no particular need for the components to be of the same mode or type, and,
for example, a list could consist of a numeric vector, a logical value, a matrix, a complex
vector, a character array, a function, and so on. Here is a simple example of how to make
a list:
> Lst <- list(name="Fred", wife="Mary", no.children=3,
child.ages=c(4,7,9))
Components are always numbered and may always be referred to as such. Thus if Lst is
the name of a list with four components, these may be individually referred to as Lst[[1]],
Lst[[2]], Lst[[3]] and Lst[[4]]. If, further, Lst[[4]] is a vector subscripted array then
Lst[[4]][1] is its first entry.
If Lst is a list, then the function length(Lst) gives the number of (top level) components
it has.
Components of lists may also be named, and in this case the component may be referred
to either by giving the component name as a character string in place of the number in
double square brackets, or, more conveniently, by giving an expression of the form
> name$component name
for the same thing.
This is a very useful convention as it makes it easier to get the right component if you
forget the number.
So in the simple example given above:
Lst$name is the same as Lst[[1]] and is the string "Fred",
Lst$wife is the same as Lst[[2]] and is the string "Mary",
Lst$child.ages[1] is the same as Lst[[4]][1] and is the number 4.
Additionally, one can also use the names of the list components in double square brackets,
i.e., Lst[["name"]] is the same as Lst$name. This is especially useful, when the name of
the component to be extracted is stored in another variable as in
> x <- "name"; Lst[[x]]
It is very important to distinguish Lst[[1]] from Lst[1]. ‘[[. . . ]]’ is the operator
used to select a single element, whereas ‘[. . . ]’ is a general subscripting operator. Thus the
former is the first object in the list Lst, and if it is a named list the name is not included.
The latter is a sublist of the list Lst consisting of the first entry only. If it is a named list,
the name is transferred to the sublist.
The names of components may be abbreviated down to the minimum number of letters
needed to identify them uniquely. Thus Lst$coefficients may be minimally specified as
Lst$coe and Lst$covariance as Lst$cov.
The vector of names is in fact simply an attribute of the list like any other and may be
handled as such. Other structures besides lists may, of course, similarly be given a names
attribute also.
Chapter 6: Lists and data frames 28
• gather together all variables for any well defined and separate problem in a data frame
under a suitably informative name;
• when working with a problem attach the appropriate data frame at position 2, and use
the working directory at level 1 for operational quantities and temporary variables;
• before leaving a problem, add any variables you wish to keep for future reference to
the data frame using the $ form of assignment, and then detach();
• finally remove all unwanted variables from the working directory and keep it as clean
of left-over temporary variables as possible.
In this way it is quite simple to work with many problems in the same directory, all of
which have variables named x, y and z, for example.
1
See the on-line help for autoload for the meaning of the second term.
Chapter 7: Reading data from files 31
Large data objects will usually be read as values from external files rather than entered
during an R session at the keyboard. R input facilities are simple and their requirements
are fairly strict and even rather inflexible. There is a clear presumption by the designers of
R that you will be able to modify your input files using other tools, such as file editors or
Perl1 to fit in with the requirements of R. Generally this is very simple.
If variables are to be held mainly in data frames, as we strongly suggest they should be,
an entire data frame can be read directly with the read.table() function. There is also a
more primitive input function, scan(), that can be called directly.
For more details on importing data into R and also exporting data, see the R Data
Import/Export manual.
If the file has one fewer item in its first line than in its second, this arrangement is
presumed to be in force. So the first few lines of a file to be read as a data frame might
look as follows.
Input file form with names and row labels:
By default numeric items (except row labels) are read as numeric variables and non-
numeric variables, such as Cent.heat in the example, as factors. This can be changed if
necessary.
The function read.table() can then be used to read the data frame directly
> HousePrice <- read.table("houses.data")
Often you will want to omit including the row labels directly and use the default labels.
In this case the file may omit the row label column as in the following.
1
Under UNIX, the utilities Sed or Awk can be used.
Chapter 7: Reading data from files 32
Input file form without row labels:
data()
and to load one of these use, for example,
data(infert)
In most cases this will load an R object of the same name, usually a data frame. However,
in a few cases it loads several objects, so see the on-line help for the object to see what to
expect.
8 Probability distributions
Prefix the name given here by ‘d’ for the density, ‘p’ for the CDF, ‘q’ for the quantile
function and ‘r’ for simulation (r andom deviates). The first argument is x for dxxx, q
for pxxx, p for qxxx and n for rxxx (except for rhyper and rwilcox, for which it is nn).
The non-centrality parameter ncp is currently only available for the CDFs and a few other
functions: see the on-line help for current details.
The pxxx and qxxx functions all have logical arguments lower.tail and log.p and
the dxxx ones have log. This allows, e.g., getting the cumulative (or “integrated”) hazard
function, H(t) = − log(1 − F (t)), by
- pxxx(t, ..., lower.tail = FALSE, log.p = TRUE)
or more accurate log-likelihoods (by dxxx(..., log = TRUE)), directly.
In addition there are functions ptukey and qtukey for the distribution of the studentized
range of samples from a normal distribution.
Here are some examples
> ## 2-tailed p-value for t distribution
> 2*pt(-2.43, df = 13)
> ## upper 1% point for an F(2, 7) distribution
> qf(0.99, 2, 7)
Chapter 8: Probability distributions 35
Given a (univariate) set of data we can examine its distribution in a large number of
ways. The simplest is to examine the numbers. Two slightly different summaries are given
by summary and fivenum and a display of the numbers by stem (a “stem and leaf” plot).
> data(faithful)
> attach(faithful)
> summary(eruptions)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.600 2.163 4.000 3.488 4.454 5.100
> fivenum(eruptions)
[1] 1.6000 2.1585 4.0000 4.4585 5.1000
> stem(eruptions)
16 | 070355555588
18 | 000022233333335577777777888822335777888
20 | 00002223378800035778
22 | 0002335578023578
24 | 00228
26 | 23
28 | 080
30 | 7
32 | 2337
34 | 250077
36 | 0000823577
38 | 2333335582225577
40 | 0000003357788888002233555577778
42 | 03335555778800233333555577778
44 | 02222335557780000000023333357778888
46 | 0000233357700000023578
48 | 00000022335800333
50 | 0370
A stem-and-leaf plot is like a histogram, and R has a function hist to plot histograms.
> hist(eruptions)
## make the bins smaller, make a plot of density
> hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
> lines(density(eruptions, bw=0.1))
> rug(eruptions) # show the actual data points
More elegant density plots can be made by density, and we added a line produced by
density in this example. The bandwidth bw was chosen by trial-and-error as the default
gives too much smoothing (it usually does for “interesting” densities). (Automated methods
of bandwidth choice are implemented in packages MASS and KernSmooth.)
Chapter 8: Probability distributions 36
Histogram of eruptions
0.7
0.6
0.5
Relative Frequency
0.4
0.3
0.2
0.1
0.0
eruptions
We can plot the empirical cumulative distribution function by using function ecdf in
the standard package stepfun.
> library(stepfun)
> plot(ecdf(eruptions), do.points=FALSE, verticals=TRUE)
This distribution is obviously far from any standard distribution. How about the right-
hand mode, say eruptions of longer than 3 minutes? Let us fit a normal distribution and
overlay the fitted CDF.
> long <- eruptions[eruptions > 3]
> plot(ecdf(long), do.points=FALSE, verticals=TRUE)
> x <- seq(3, 5.4, 0.01)
> lines(x, pnorm(x, mean=mean(long), sd=sqrt(var(long))), lty=3)
ecdf(long)
1.0
0.8
0.6
Fn(x)
0.4
0.2
0.0
5.0
4.5
Sample Quantiles
4.0
3.5
3.0
−2 −1 0 1 2
Theoretical Quantiles
x <- rt(250, df = 5)
qqnorm(x); qqline(x)
which will usually (it is a random sample) show longer tails than expected for a normal.
We can make a Q-Q plot against the generating distribution by
qqplot(qt(ppoints(250), df=5), x, xlab="Q-Q plot for t dsn")
qqline(x)
Finally, we might want a more formal test of agreement with normality (or not). Package
ctest provides the Shapiro-Wilk test
> library(ctest)
> shapiro.test(long)
data: long
W = 0.9793, p-value = 0.01052
and the Kolmogorov-Smirnov test
> ks.test(long, "pnorm", mean=mean(long), sd=sqrt(var(long)))
data: long
D = 0.0661, p-value = 0.4284
alternative hypothesis: two.sided
(Note that the distribution theory is not valid here as we have estimated the parameters of
the normal distribution from the same sample.)
Consider the following sets of data on the latent heat of the fusion of ice (cal/gm) from
Rice (1995, p.490)
Method A: 79.98 80.04 80.02 80.04 80.03 80.03 80.04 79.97
80.05 80.03 80.02 80.00 80.02
Method B: 80.02 79.94 79.98 79.97 79.97 80.03 79.95 79.97
Boxplots provide a simple graphical comparison of the two samples.
A <- scan()
79.98 80.04 80.02 80.04 80.03 80.03 80.04 79.97
80.05 80.03 80.02 80.00 80.02
B <- scan()
80.02 79.94 79.98 79.97 79.97 80.03 79.95 79.97
boxplot(A, B)
which indicates that the first group tends to give higher results than the second.
80.04
80.02
80.00
79.98
79.96
79.94
1 2
To test for the equality of the means of the two examples, we can use an unpaired t-test
by
> library(ctest)
> t.test(A, B)
data: A and B
t = 3.2499, df = 12.027, p-value = 0.00694
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.01385526 0.07018320
sample estimates:
mean of x mean of y
80.02077 79.97875
Chapter 8: Probability distributions 39
which does indicate a significant difference, assuming normality. By default the R function
does not assume equality of variances in the two samples (in contrast to the similar S-Plus
t.test function). We can use the F test to test for equality in the variances, provided that
the two samples are from normal populations.
> var.test(A, B)
data: A and B
F = 0.5837, num df = 12, denom df = 7, p-value = 0.3938
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.1251097 2.1052687
sample estimates:
ratio of variances
0.5837405
which shows no evidence of a significant difference, and so we can use the classical t-test
that assumes equality of the variances.
> t.test(A, B, var.equal=TRUE)
data: A and B
t = 3.4722, df = 19, p-value = 0.002551
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.01669058 0.06734788
sample estimates:
mean of x mean of y
80.02077 79.97875
All these tests assume normality of the two samples. The two-sample Wilcoxon (or
Mann-Whitney) test only assumes a common continuous distribution under the null hy-
pothesis.
> wilcox.test(A, B)
data: A and B
W = 89, p-value = 0.007497
alternative hypothesis: true mu is not equal to 0
Warning message:
Cannot compute exact p-value with ties in: wilcox.test(A, B)
Note the warning: there are several ties in each sample, which suggests strongly that these
data are from a discrete distribution (probably due to rounding).
There are several ways to compare graphically the two samples. We have already seen
a pair of boxplots. The following
Chapter 8: Probability distributions 40
> library(stepfun)
> plot(ecdf(A), do.points=FALSE, verticals=TRUE, xlim=range(A, B))
> plot(ecdf(B), do.points=FALSE, verticals=TRUE, add=TRUE)
will show the two empirical CDFs, and qqplot will perform a Q-Q plot of the two samples.
The Kolmogorov-Smirnov test is of the maximal vertical distance between the two ecdf’s,
assuming a common continuous distribution:
> ks.test(A, B)
data: A and B
D = 0.5962, p-value = 0.05919
alternative hypothesis: two.sided
Warning message:
cannot compute correct p-values with ties in: ks.test(A, B)
Chapter 9: More language features. Loops and conditional execution 41
abline(lsfit(xc[[i]], yc[[i]]))
}
(Note the function split() which produces a list of vectors got by splitting a larger
vector according to the classes specified by a category. This is a useful function, mostly
used in connection with boxplots. See the help facility for further details.)
WARNING: for() loops are used in R code much less often than in compiled
languages. Code that takes a ‘whole object’ view is likely to be both clearer
and faster in R.
Other looping facilities include the
> repeat expr
statement and the
> while (condition) expr
statement.
The break statement can be used to terminate any loop, possibly abnormally. This is
the only way to terminate repeat loops.
The next statement can be used to discontinue one particular cycle and skip to the
“next”.
Control statements are most often used in connection with functions which are discussed
in Chapter 10 [Writing your own functions], page 43, and where more examples will emerge.
Chapter 10: Writing your own functions 43
After this object is created it is permanent, like all objects, and may be used in statements
such as
> regcoeff <- bslash(Xmat, yvar)
and so on.
The classical R function lsfit() does this job quite well, and more1 . It in turn uses the
functions qr() and qr.coef() in the slightly counterintuitive way above to do this part of
the calculation. Hence there is probably some value in having just this part isolated in a
simple to use function if it is going to be in frequent use. If so, we may wish to make it a
matrix binary operator for even more convenient use.
A block design is defined by two factors, say blocks (b levels) and varieties (v levels).
If R and K are the v by v and b by b replications and block size matrices, respectively, and
N is the b by v incidence matrix, then the efficiency factors are defined as the eigenvalues
of the matrix
E = Iv − R−1/2 N 0 K −1 N R−1/2 = Iv − A0 A,
where A = K −1/2 N R−1/2 . One way to write the function is given below.
> bdeff <- function(blocks, varieties) {
blocks <- as.factor(blocks) # minor safety move
b <- length(levels(blocks))
varieties <- as.factor(varieties) # minor safety move
v <- length(levels(varieties))
K <- as.vector(table(blocks)) # remove dim attr
R <- as.vector(table(varieties)) # remove dim attr
N <- table(blocks, varieties)
A <- 1/sqrt(K) * N * rep(1/sqrt(R), rep(b, v))
sv <- svd(A)
list(eff=1 - sv$d^2, blockcv=sv$u, varietycv=sv$v)
}
It is numerically slightly better to work with the singular value decomposition on this
occasion rather than the eigenvalue routines.
The result of the function is a list giving not only the efficiency factors as the first
component, but also the block and variety canonical contrasts, since sometimes these give
additional useful qualitative information.
> no.dimnames(X)
This is particularly useful for large integer arrays, where patterns are the real interest
rather than the values.
10.7 Scope
The discussion in this section is somewhat more technical than in other parts of this
document. However, it details one of the major differences between S-Plus and R.
The symbols which occur in the body of a function can be divided into three classes;
formal parameters, local variables and free variables. The formal parameters of a function
are those occurring in the argument list of the function. Their values are determined by
the process of binding the actual function arguments to the formal parameters. Local
Chapter 10: Writing your own functions 48
variables are those whose values are determined by the evaluation of expressions in the
body of the functions. Variables which are not formal parameters or local variables are
called free variables. Free variables become local variables if they are assigned to. Consider
the following function definition.
f <- function(x) {
y <- 2*x
print(x)
print(y)
print(z)
}
In this function, x is a formal parameter, y is a local variable and z is a free variable.
In R the free variable bindings are resolved by first looking in the environment in which
the function was created. This is called lexical scope. First we define a function called cube.
cube <- function(n) {
sq <- function() n*n
n*sq()
}
The variable n in the function sq is not an argument to that function. Therefore it
is a free variable and the scoping rules must be used to ascertain the value that is to be
associated with it. Under static scope (S-Plus) the value is that associated with a global
variable named n. Under lexical scope (R) it is the parameter to the function cube since
that is the active binding for the variable n at the time the function sq was defined. The
difference between evaluation in R and evaluation in S-Plus is that S-Plus looks for a
global variable called n while R first looks for a variable called n in the environment created
when cube was invoked.
## first evaluation in S
S> cube(2)
Error in sq(): Object "n" not found
Dumped
S> n <- 3
S> cube(2)
[1] 18
## then the same function evaluated in R
R> cube(2)
[1] 8
Lexical scope can also be used to give functions mutable state. In the following example
we show how R can be used to mimic a bank account. A functioning bank account needs to
have a balance or total, a function for making withdrawals, a function for making deposits
and a function for stating the current balance. We achieve this by creating the three func-
tions within account and then returning a list containing them. When account is invoked
it takes a numerical argument total and returns a list containing the three functions. Be-
cause these functions are defined in an environment which contains total, they will have
access to its value.
The special assignment operator, <<-, is used to change the value associated with total.
This operator looks back in enclosing environments for an environment that contains the
symbol total and when it finds such an environment it replaces the value, in that environ-
Chapter 10: Writing your own functions 49
ment, with the value of right hand side. If the global or top-level environment is reached
without finding the symbol total then that variable is created and assigned to there. For
most users <<- creates a global variable and assigns the value of the right hand side to
it2 . Only when <<- has been used in a function that was returned as the value of another
function will the special behavior described here occur.
open.account <- function(total) {
list(
deposit = function(amount) {
if(amount <= 0)
stop("Deposits must be positive!\n")
total <<- total + amount
cat(amount, "deposited. Your balance is", total, "\n\n")
},
withdraw = function(amount) {
if(amount > total)
stop("You don’t have that much money!\n")
total <<- total - amount
cat(amount, "withdrawn. Your balance is", total, "\n\n")
},
balance = function() {
cat("Your balance is", total, "\n\n")
}
)
}
ross$withdraw(30)
ross$balance()
robert$balance()
ross$deposit(50)
ross$balance()
ross$withdraw(500)
be placed in any directory. If R is invoked in that directory then that file will be sourced.
This file gives individual users control over their workspace and allows for different startup
procedures in different working directories. If no ‘.Rprofile’ file is found in the startup
directory, than R looks for a ‘.Rprofile’ file in the user’s home directory and uses that (if
it exists).
Any function named .First() in either of the two profile files or in the ‘.RData’ image
has a special status. It is automatically performed at the beginning of an R session and
may be used to initialize the environment. For example, the definition in the example below
alters the prompt to $ and sets up various other useful things that can then be taken for
granted in the rest of the session.
Thus, the sequence in which files are executed is, ‘Rprofile.site’, ‘.Rprofile’,
‘.RData’ and then .First(). A definition in later files will mask definitions in earlier files.
> .First <- function() {
options(prompt="$ ", continue="+\t") # $ is the prompt
options(digits=5, length=999) # custom numbers and printout
x11() # for graphics
par(pch = "+") # plotting character
source(file.path(Sys.getenv("HOME"), "R", "mystuff.R"))
# my personal package
library(stepfun) # attach the step function tools
}
Similarly a function .Last(), if defined, is executed at the very end of the session. An
example is given below.
> .Last <- function() {
graphics.off() # a small safety measure.
cat(paste(date(),"\nAdios\n")) # Is it time for lunch?
}
> methods(class="data.frame")
Conversely the number of classes a generic function can handle can also be quite large.
For example the plot() function has a default method and variants for objects of classes
"data.frame", "density", "factor", and more. A complete list can be got again by using
the methods() function:
> methods(plot)
The reader is referred to the official references for a complete discussion of this mecha-
nism.
Chapter 11: Statistical models in R 52
11 Statistical models in R
This section presumes the reader has some familiarity with statistical methodology, in
particular with regression analysis and the analysis of variance. Later we make some rather
more ambitious presumptions, namely that something is known about generalized linear
models and nonlinear regression.
The requirements for fitting statistical models are sufficiently well defined to make it
possible to construct general tools that apply in a broad spectrum of problems.
R provides an interlocking suite of facilities that make fitting statistical models very
simple. As we mention in the introduction, the basic output is minimal, and one needs to
ask for the details by calling extractor functions.
y = Xβ + e
where the y is the response vector, X is the model matrix or design matrix and has columns
x 0, x 1, . . . , x p, the determining variables. Very often x 0 will be a column of ones defining
an intercept term.
Examples
Before giving a formal specification, a few examples may usefully set the picture.
Suppose y, x, x0, x1, x2, . . . are numeric variables, X is a matrix and A, B, C, . . . are
factors. The following formulae on the left side below specify statistical models as described
on the right.
y~x
y~1+x Both imply the same simple linear regression model of y on x. The first has an
implicit intercept term, and the second an explicit one.
y~0+x
y ~ -1 + x
y ~ x - 1 Simple linear regression of y on x through the origin (that is, without an inter-
cept term).
log(y) ~ x1 + x2
Multiple regression of the transformed variable, log(y), on x1 and x2 (with an
implicit intercept term).
Chapter 11: Statistical models in R 53
y ~ poly(x,2)
y ~ 1 + x + I(x^2)
Polynomial regression of y on x of degree 2. The first form uses orthogonal
polynomials, and the second uses explicit powers, as basis.
y ~ X + poly(x,2)
Multiple regression y with model matrix consisting of the matrix X as well as
polynomial terms in x to degree 2.
y~A Single classification analysis of variance model of y, with classes determined by
A.
y~A+x Single classification analysis of covariance model of y, with classes determined
by A, and with covariate x.
y ~ A*B
y ~ A + B + A:B
y ~ B %in% A
y ~ A/B Two factor non-additive model of y on A and B. The first two specify the same
crossed classification and the second two specify the same nested classification.
In abstract terms all four specify the same model subspace.
y ~ (A + B + C)^2
y ~ A*B*C - A:B:C
Three factor experiment but with a model containing main effects and two
factor interactions only. Both formulae specify the same model.
y~A*x
y ~ A/x
y ~ A/(1 + x) - 1
Separate simple linear regression models of y on x within the levels of A, with
different codings. The last form produces explicit estimates of as many different
intercepts and slopes as there are levels in A.
y ~ A*B + Error(C)
An experiment with two treatment factors, A and B, and error strata deter-
mined by factor C. For example a split plot experiment, with whole plots (and
hence also subplots), determined by factor C.
The operator ~ is used to define a model formula in R. The form, for an ordinary linear
model, is
response ~ op 1 term 1 op 2 term 2 op 3 term 3 . . .
where
response is a vector or matrix, (or expression evaluating to a vector or matrix) defining
the response variable(s).
op i is an operator, either + or -, implying the inclusion or exclusion of a term in
the model, (the first is optional).
term i is either
• a vector or matrix expression, or 1,
Chapter 11: Statistical models in R 54
• a factor, or
• a formula expression consisting of factors, vectors or matrices connected
by formula operators.
In all cases each term defines a collection of columns either to be added to or
removed from the model matrix. A 1 stands for an intercept column and is by
default included in the model matrix unless explicitly removed.
The formula operators are similar in effect to the Wilkinson and Rogers notation used
by such programs as Glim and Genstat. One inevitable change is that the operator ‘.’
becomes ‘:’ since the period is a valid name character in R.
The notation is summarized below (based on Chambers & Hastie, 1992, p.29):
Y ~M Y is modeled as M.
M1 +M2
Include M 1 and M 2.
M1 -M2
Include M 1 leaving out terms of M 2.
M1 :M2
The tensor product of M 1 and M 2. If both terms are factors, then the “sub-
classes” factor.
M 1 %in% M 2
Similar to M 1:M 2, but with a different coding.
M1 *M2
M 1 + M 2 + M 1:M 2.
M1 /M2
M 1 + M 2 %in% M 1.
M ^n All terms in M together with “interactions” up to order n
I(M ) Insulate M. Inside M all operators have their normal arithmetic meaning, and
that term appears in the model matrix.
Note that inside the parentheses that usually enclose function arguments all operators
have their normal arithmetic meaning. The function I() is an identity function used only
to allow terms in model formulae to be defined using arithmetic operators.
Note particularly that the model formulae specify the columns of the model matrix, the
specification of the parameters being implicit. This is not the case in other contexts, for
example in specifying nonlinear models.
11.1.1 Contrasts
We need at least some idea how the model formulae specify the columns of the model
matrix. This is easy if we have continuous variables, as each provides one column of the
model matrix (and the intercept will provide a column of ones if included in the model).
What about a k-level factor A? The answer differs for unordered and ordered factors.
For unordered factors k − 1 columns are generated for the indicators of the second, . . . ,
Chapter 11: Statistical models in R 55
kth levels of the factor. (Thus the implicit parameterization is to contrast the response at
each level with that at the first.) For ordered factors the k − 1 columns are the orthogonal
polynomials on 1, . . . , k, omitting the constant term.
Although the answer is already complicated, it is not the whole story. First, if the
intercept is omitted in a model that contains a factor term, the first such term is encoded
into k columns giving the indicators for all the levels. Second, the whole behavior can be
changed by the options setting for contrasts. The default setting in R is
options(contrasts = c("contr.treatment", "contr.poly"))
The main reason for mentioning this is that R and S have different defaults for unordered
factors, S using Helmert contrasts. So if you need to compare your results to those of a
textbook or paper which used S-Plus, you will need to set
options(contrasts = c("contr.helmert", "contr.poly"))
This is a deliberate difference, as treatment contrasts (R’s default) are thought easier for
newcomers to interpret.
We have still not finished, as the contrast scheme to be used can be set for each term in
the model using the functions contrasts and C.
We have not yet considered interaction terms: these generate the products of the columns
introduced for their component terms.
Although the details are complicated, model formulae in R will normally generate the
models that an expert statistician would expect, provided that marginality is preserved.
Fitting, for example, model with interaction but not the corresponding main effects will in
general lead to surprising results, and is for experts only.
anova(object 1, object 2)
Compare a submodel with an outer model and produce an analysis of variance
table.
coefficients(object)
Extract the regression coefficient (matrix).
Short form: coef(object).
deviance(object)
Residual sum of squares, weighted if appropriate.
formula(object)
Extract the model formula.
plot(object)
Produce four plots, showing residuals, fitted values and some diagnostics.
predict(object, newdata=data.frame)
The data frame supplied must have variables specified with the same labels as
the original. The value is a vector or matrix of predicted values corresponding
to the determining variable values in data.frame.
print(object)
Print a concise version of the object. Most often used implicitly.
residuals(object)
Extract the (matrix of) residuals, weighted as appropriate.
Short form: resid(object).
step(object)
Select a suitable model by adding or dropping terms and preserving hierarchies.
The model with the largest value of AIC (Akaike’s An Information Criterion)
discovered in the stepwise search is returned.
summary(object)
Print a comprehensive summary of the results of the regression analysis.
Other functions for exploring incremental sequences of models are add1(), drop1() and
step(). The names of these give a good clue to their purpose, but for full details see the
on-line help.
η = β1 x1 + β2 x2 + · · · + βp xp ,
where ϕ is a scale parameter (possibly known), and is constant for all observations, A
represents a prior weight, assumed known but possibly varying with the observations,
and µ is the mean of y. So it is assumed that the distribution of y is determined by its
mean and possibly a scale parameter as well.
• The mean, µ, is a smooth invertible function of the linear predictor:
11.6.1 Families
The class of generalized linear models handled by facilities supplied in R includes gaus-
sian, binomial, poisson, inverse gaussian and gamma response distributions and also quasi-
likelihood models where the response distribution is not explicitly specified. In the latter
case the variance function must be specified as a function of the mean, but in other cases
this function is implied by the response distribution.
Each response distribution admits a variety of link functions to connect the mean with
the linear predictor. Those automatically available are as in the table below.
Chapter 11: Statistical models in R 59
The combination of a response distribution, a link function and various other pieces of
information that are needed to carry out the modeling exercise is called the family of the
generalized linear model.
Since the distribution of the response depends on the stimulus variables through a single
linear function only, the same mechanism as was used for linear models can still be used to
specify the linear part of a generalized model. The family has to be specified in a different
way.
The R function to fit a generalized linear model is glm() which uses the form
> fitted.model <- glm(formula, family=family.generator, data=data.frame)
The only new feature is the family.generator, which is the instrument by which the family
is described. It is the name of a function that generates a list of functions and expressions
that together define and control the model and estimation process. Although this may seem
a little complicated at first sight, its use is quite simple.
The names of the standard, supplied family generators are given under “Family Name”
in the table in Section 11.6.1 [Families], page 58. Where there is a choice of links, the name
of the link may also be supplied with the family name, in parentheses as a parameter. In
the case of the quasi family, the variance function may also be specified in this way.
Some examples make the process clear.
A call such as
> fm <- glm(y ~ x1 + x2, family = gaussian, data = sales)
achieves the same result as
> fm <- lm(y ~ x1+x2, data=sales)
but much less efficiently. Note how the gaussian family is not automatically provided with
a choice of links, so no parameter is allowed. If a problem requires a gaussian family with
a nonstandard link, this can usually be achieved through the quasi family, as we shall see
later.
Chapter 11: Statistical models in R 60
where for the probit case, F (z) = Φ(z) is the standard normal distribution function, and in
the logit case (the default), F (z) = ez /(1 + ez ). In both cases the LD50 is
that is, the point at which the argument of the distribution function is zero.
The first step is to set the data up as a data frame
> kalythos <- data.frame(x = c(20,35,45,55,70), n = rep(50,5),
y = c(6,17,26,37,44))
To fit a binomial model using glm() there are two possibilities for the response:
• If the response is a vector it is assumed to hold binary data, and so must be a 0/1
vector.
• If the response is a two column matrix it is assumed that the first column holds the
number of successes for the trial and the second holds the number of failures.
Here we need the second of these conventions, so we add a matrix to our data frame:
> kalythos$Ymat <- cbind(kalythos$y, kalythos$n - kalythos$y)
To fit the models we use
> fmp <- glm(Ymat ~ x, family = binomial(link=probit), data = kalythos)
> fml <- glm(Ymat ~ x, family = binomial, data = kalythos)
Since the logit link is the default the parameter may be omitted on the second call. To
see the results of each fit we could use
> summary(fmp)
> summary(fml)
Both models fit (all too) well. To find the LD50 estimate we can use a simple function:
> ld50 <- function(b) -b[1]/b[2]
> ldp <- ld50(coef(fmp)); ldl <- ld50(coef(fmp)); c(ldp, ldl)
The actual estimates from this data are 43.663 years and 43.601 years respectively.
Chapter 11: Statistical models in R 61
Poisson models
With the Poisson family the default link is the log, and in practice the major use of
this family is to fit surrogate Poisson log-linear models to frequency data, whose actual
distribution is often multinomial. This is a large and important subject we will not discuss
further here. It even forms a major part of the use of non-gaussian generalized models
overall.
Occasionally genuinely Poisson data arises in practice and in the past it was often an-
alyzed as gaussian data after either a log or a square-root transformation. As a graceful
alternative to the latter, a Poisson generalized linear model may be fitted as in the following
example:
> fmod <- glm(y ~ A + B + x, family = poisson(link=sqrt),
data = worm.counts)
Quasi-likelihood models
For all families the variance of the response will depend on the mean and will have
the scale parameter as a multiplier. The form of dependence of the variance on the mean
is a characteristic of the response distribution; for example for the poisson distribution
Var[y] = µ.
For quasi-likelihood estimation and inference the precise response distribution is not
specified, but rather only a link function and the form of the variance function as it depends
on the mean. Since quasi-likelihood estimation uses formally identical techniques to those
for the gaussian distribution, this family provides a way of fitting gaussian models with
non-standard link functions or variance functions, incidentally.
For example, consider fitting the non-linear regression
θ 1 z1
y= +e
z2 − θ 2
1
y= +e
β1 x1 + β2 x2
where x1 = z2 /z1 , x2 = −1/x1 , β1 = 1/θ1 and β2 = θ2 /θ1 . Supposing a suitable data frame
to be set up we could fit this non-linear regression as
> nlfit <- glm(y ~ x1 + x2 - 1,
family = quasi(link=inverse, variance=constant),
data = biochem)
The reader is referred to the manual and the help document for further information, as
needed.
Chapter 11: Statistical models in R 62
Parameters:
Estimate Std. Error t value Pr(>|t|)
Vm 2.127e+02 6.947e+00 30.615 3.24e-11
K 6.412e-02 8.281e-03 7.743 1.57e-05
12 Graphical procedures
Graphical facilities are an important and extremely versatile component of the R envi-
ronment. It is possible to use the facilities to display a wide variety of statistical graphs
and also to build entirely new types of graph.
The graphics facilities can be used in both interactive and batch modes, but in most
cases, interactive use is more productive. Interactive use is also easy because at startup
time R initiates a graphics device driver which opens a special graphics window for the
display of interactive graphics. Although this is done automatically, it is useful to know
that the command used is X11() under UNIX, windows() under Windows and macintosh()
on MacOS 8/9.
Once the device driver is running, R plotting commands can be used to produce a variety
of graphical displays and to create entirely new kinds of display.
Plotting commands are divided into three basic groups:
• High-level plotting functions create a new plot on the graphics device, possibly with
axes, labels, titles and so on.
• Low-level plotting functions add more information to an existing plot, such as extra
points, lines and labels.
• Interactive graphics functions allow you interactively add information to, or extract
information from, an existing plot, using a pointing device such as a mouse.
In addition, R maintains a list of graphical parameters which can be manipulated to
customize your plots.
plot(f )
plot(f, y)
f is a factor object, y is a numeric vector. The first form generates a bar plot
of f ; the second form produces boxplots of y for each level of f.
plot(df )
plot(~ expr)
plot(y ~ expr)
df is a data frame, y is any object, expr is a list of object names separated
by ‘+’ (e.g., a + b + c). The first two forms produce distributional plots of the
variables in a data frame (first form) or of a number of named objects (second
form). The third form plots y against every object named in expr.
qqnorm(x)
qqline(x)
qqplot(x, y)
Distribution-comparison plots. The first form plots the numeric vector x against
the expected Normal order scores (a normal scores plot) and the second adds a
straight line to such a plot by drawing a line through the distribution and data
quartiles. The third form plots the quantiles of x against those of y to compare
their respective distributions.
hist(x)
hist(x, nclass=n)
hist(x, breaks=b, ...)
Produces a histogram of the numeric vector x. A sensible number of classes is
usually chosen, but a recommendation can be given with the nclass= argument.
Alternatively, the breakpoints can be specified exactly with the breaks= argu-
ment. If the probability=TRUE argument is given, the bars represent relative
frequencies instead of counts.
dotchart(x, ...)
Constructs a dotchart of the data in x. In a dotchart the y-axis gives a labelling
of the data in x and the x-axis gives its value. For example it allows easy visual
selection of all data entries with values lying in specified ranges.
image(x, y, z, ...)
contour(x, y, z, ...)
persp(x, y, z, ...)
Plots of three variables. The image plot draws a grid of rectangles using different
colours to represent the value of z, the contour plot draws contour lines to
represent the value of z, and the persp plot draws a 3D surface.
points(x, y)
lines(x, y)
Adds points or connected lines to the current plot. plot()’s type= argument
can also be passed to these functions (and defaults to "p" for points() and
"l" for lines().)
text(x, y, labels, ...)
Add text to a plot at points given by x, y. Normally labels is an integer or
character vector in which case labels[i] is plotted at point (x[i], y[i]).
The default is 1:length(x).
Note: This function is often used in the sequence
> plot(x, y, type="n"); text(x, y, names)
The graphics parameter type="n" suppresses the points but sets up the axes,
and the text() function supplies special characters, as specified by the charac-
ter vector names for the points.
Chapter 12: Graphical procedures 69
abline(a, b)
abline(h=y)
abline(v=x)
abline(lm.obj)
Adds a line of slope b and intercept a to the current plot. h=y may be used to
specify y-coordinates for the heights of horizontal lines to go across a plot, and
v=x similarly for the x-coordinates for vertical lines. Also lm.obj may be list
with a coefficients component of length 2 (such as the result of model-fitting
functions,) which are taken as an intercept and slope, in that order.
polygon(x, y, ...)
Draws a polygon defined by the ordered vertices in (x, y) and (optionally) shade
it in with hatch lines, or fill it if the graphics device allows the filling of figures.
legend(x, y, legend, ...)
Adds a legend to the current plot at the specified position. Plotting characters,
line styles, colors etc., are identified with the labels in the character vector
legend. At least one other argument v (a vector the same length as legend)
with the corresponding values of the plotting unit must also be given, as follows:
legend( , fill=v)
Colors for filled boxes
legend( , col=v)
Colors in which points or lines will be drawn
legend( , lty=v)
Line styles
legend( , lwd=v)
Line widths
legend( , pch=v)
Plotting characters (character vector)
title(main, sub)
Adds a title main to the top of the current plot in a large font and (optionally)
a sub-title sub at the bottom in a smaller font.
axis(side, ...)
Adds an axis to the current plot on the side given by the first argument (1 to 4,
counting clockwise from the bottom.) Other arguments control the positioning
of the axis within or beside the plot, and tick positions and labels. Useful for
adding custom axes after calling plot() with the axes=FALSE argument.
Low-level plotting functions usually require some positioning information (e.g., x and y
coordinates) to determine where to place the new plot elements. Coordinates are given in
terms of user coordinates which are defined by the previous high-level graphics command
and are chosen based on the supplied data.
Where x and y arguments are required, it is also sufficient to supply a single argument
being a list with elements named x and y. Similarly a matrix with two columns is also
valid input. In this way functions such as locator() (see below) may be used to specify
positions on a plot interactively.
Chapter 12: Graphical procedures 70
par(c("col", "lty"))
With a character vector argument, returns only the named graphics parameters
(again, as a list.)
par(col=4, lty=2)
With named arguments (or a single list argument), sets the values of the named
graphics parameters, and returns the original values of the parameters as a list.
Setting graphics parameters with the par() function changes the value of the parameters
permanently, in the sense that all future calls to graphics functions (on the current device)
will be affected by the new value. You can think of setting graphics parameters in this way
as setting “default” values for the parameters, which will be used by all graphics functions
unless an alternative value is given.
Note that calls to par() always affect the global values of graphics parameters, even
when par() is called from within a function. This is often undesirable behavior—usually
we want to set some graphics parameters, do some plotting, and then restore the original
values so as not to affect the user’s R session. You can restore the initial values by saving
the result of par() when making changes, and restoring the initial values when plotting is
complete.
> oldpar <- par(col=4, lty=2)
. . . plotting commands . . .
> par(oldpar)
pch="+" Character to be used for plotting points. The default varies with graphics
drivers, but it is usually ‘◦’. Plotted points tend to appear slightly above or
below the appropriate position unless you use "." as the plotting character,
which produces centered points.
pch=4 When pch is given as an integer between 0 and 18 inclusive, a specialized
plotting symbol is produced. To see what the symbols are, use the command
> legend(locator(1), as.character(0:18), pch = 0:18)
lty=2 Line types. Alternative line styles are not supported on all graphics devices
(and vary on those that do) but line type 1 is always a solid line, and line types
2 and onwards are dotted or dashed lines, or some combination of both.
lwd=2 Line widths. Desired width of lines, in multiples of the “standard” line width.
Affects axis lines as well as lines drawn with lines(), etc.
col=2 Colors to be used for points, lines, text, filled regions and images. Each of these
graphic elements has a list of possible colors, and the value of this parameter is
an index to that list. Obviously, this parameter applies only to a limited range
of devices.
font=2 An integer which specifies which font to use for text. If possible, device drivers
arrange so that 1 corresponds to plain text, 2 to bold face, 3 to italic and 4 to
bold italic.
font.axis
font.lab
font.main
font.sub The font to be used for axis annotation, x and y labels, main and sub-titles,
respectively.
adj=-0.1 Justification of text relative to the plotting position. 0 means left justify, 1
means right justify and 0.5 means to center horizontally about the plotting
position. The actual value is the proportion of text that appears to the left of
the plotting position, so a value of -0.1 leaves a gap of 10% of the text width
between the text and the plotting position.
cex=1.5 Character expansion. The value is the desired size of text characters (including
plotting characters) relative to the default text size.
las=1 Orientation of axis labels. 0 means always parallel to axis, 1 means always
horizontal, and 2 means always perpendicular to the axis.
mgp=c(3, 1, 0)
Positions of axis components. The first component is the distance from the axis
label to the axis position, in text lines. The second component is the distance to
the tick labels, and the final component is the distance from the axis position to
the axis line (usually zero). Positive numbers measure outside the plot region,
negative numbers inside.
tck=0.01 Length of tick marks, as a fraction of the size of the plotting region. When
tck is small (less than 0.5) the tick marks on the x and y axes are forced to
be the same size. A value of 1 gives grid lines. Negative values give tick marks
outside the plotting region. Use tck=0.01 and mgp=c(1,-1.5,0) for internal
tick marks.
xaxs="s"
yaxs="d" Axis styles for the x and y axes, respectively. With styles "s" (standard) and
"e" (extended) the smallest and largest tick marks always lie outside the range
of the data. Extended axes may be widened slightly if any points are very near
the edge. This style of axis can sometimes leave large blank gaps near the edges.
With styles "i" (internal) and "r" (the default) tick marks always fall within
the range of the data, however style "r" leaves a small amount of space at the
edges.
Setting this parameter to "d" (direct axis) locks in the current axis and uses
it for all future plots (or until the parameter is set to one of the other values
above, at least.) Useful for generating series of fixed-scale plots.
A typical figure is
Chapter 12: Graphical procedures 75
−−−−−−−−−−−−−−−−−−
−−−−−−−−−−−−−−−−−−
−−−−−−−−−−−−−−−−−−
−−−−−−−−−−−−−−−−−− mar[3]
−−−−−−−−−−−−−−−−−−
−−−−−−−−−−−−−−−−−−
3.0
Plot region
1.5
0.0
y
mai[2]
−1.5
−3.0 −3.0 −1.5 0.0 1.5 3.0
mai[1] x
Margin
mar=c(4, 2, 2, 1)
Similar to mai, except the measurement unit is text lines.
mar and mai are equivalent in the sense that setting one changes the value of the other.
The default values chosen for this parameter are often too large; the right-hand margin is
rarely needed, and neither is the top margin if no title is being used. The bottom and left
margins must be large enough to accommodate the axis and tick labels. Furthermore, the
default is chosen without regard to the size of the device surface: for example, using the
postscript() driver with the height=4 argument will result in a plot which is about 50%
margin unless mar or mai are set explicitly. When multiple figures are in use (see below)
the margins are reduced by half, however this may not be enough when many figures share
the same page.
R allows you to create an n by m array of figures on a single page. Each figure has
its own margins, and the array of figures is optionally surrounded by an outer margin, as
shown in the following figure.
Chapter 12: Graphical procedures 76
−−−−−−−−−−−−−−−
−−−−−−−−−−−−−−−
−−−−−−−−−−−−−−− oma[3]
−−−−−−−−−−−−−−−
−−−−−−−−−−−−−−−
omi[4]
mfg=c(3,2,3,2)
omi[1]
mfrow=c(3,2)
plot(x, y)
Standard point plot.
lines(x, lrf$y)
Add in the local regression.
abline(0, 1, lty=3)
The true regression line: (intercept 0, slope 1).
abline(coef(fm))
Unweighted regression line.
abline(coef(fm1), col = "red")
Weighted regression line.
detach() Remove data frame from the search path.
plot(fitted(fm), resid(fm),
xlab="Fitted values",
ylab="Residuals",
main="Residuals vs Fitted")
A standard regression diagnostic plot to check for heteroscedasticity. Can you
see it?
qqnorm(resid(fm), main="Residuals Rankit Plot")
A normal scores plot to check for skewness, kurtosis and outliers. (Not very
useful here.)
rm(fm, fm1, lrf, x, dummy)
Clean up again.
The next section will look at data from the classical experiment of Michaelson and Morley
to measure the speed of light.
file.show("morley.tab")
Optional. Look at the file.
mm <- read.table("morley.tab")
mm Read in the Michaelson and Morley data as a data frame, and look at it. There
are five experiments (column Expt) and each has 20 runs (column Run) and sl
is the recorded speed of light, suitably coded.
mm$Expt <- factor(mm$Expt)
mm$Run <- factor(mm$Run)
Change Expt and Run into factors.
attach(mm)
Make the data frame visible at position 2 (the default).
plot(Expt, Speed, main="Speed of Light Data", xlab="Experiment No.")
Compare the five experiments with simple boxplots.
fm <- aov(Speed ~ Run + Expt, data=mm)
summary(fm)
Analyze as a randomized block, with ‘runs’ and ‘experiments’ as factors.
Appendix A: A sample session 82
Appendix B Invoking R
‘--save’
‘--no-save’
Control whether data sets should be saved or not at the end of the R session.
If neither is given in an interactive session, the user is asked for the desired
behavior when ending the session with q(); in batch mode, one of these must
be specified.
‘--no-environ’
Do not read any user file to set environment variables.
‘--no-site-file’
Do not read the site-wide profile at startup.
‘--no-init-file’
Do not read the user’s profile at startup.
‘--restore’
‘--no-restore’
‘--no-restore-data’
Control whether saved images (file ‘.RData’ in the directory where R was
started) should be restored at startup or not. The default is to restore.
(‘--no-restore’ implies all the specific ‘--no-restore-*’ options.)
‘--no-restore-history’
Control whether the history file (normally file ‘.Rhistory’ in the directory
where R was started, but can be set by the environment variable R_HISTFILE)
should be restored at startup or not. The default is to restore.
‘--vanilla’
Combine ‘--no-save’, ‘--no-environ’ ‘--no-site-file’, ‘--no-init-file’
and ‘--no-restore’.
‘--no-readline’
Turn off command-line editing via readline. This is useful when running R
from within Emacs using the ess (“Emacs Speaks Statistics”) package. See
Appendix C [The command line editor], page 92, for more information.
‘--min-vsize=N ’
‘--max-vsize=N ’
Specify the minimum or maximum amount of memory used for variable size
objects by setting the “vector heap” size to N bytes. Here, N must either be
an integer or an integer ending with ‘M’, ‘K’, or ‘k’, meaning ‘Mega’ (2^20),
(computer) ‘Kilo’ (2^10), or regular ‘kilo’ (1000).
‘--min-nsize=N ’
‘--max-nsize=N ’
Specify the amount of memory used for fixed size objects by setting the number
of “cons cells” to N. See the previous option for details on N. A cons cell takes
28 bytes on a 32-bit machine, and usually 56 bytes on a 64-bit machine.
‘--quiet’
‘--silent’
‘-q’ Do not print out the initial copyright and welcome messages.
Appendix B: Invoking R 86
‘--slave’ Make R run as quietly as possible. This option is intended to support programs
which use R to compute results for them.
‘--verbose’
Print more information about progress, and in particular set R’s option verbose
to TRUE. R code uses this option to control the printing of diagnostic messages.
‘--debugger=name’
‘-d name’ Run R through debugger name. Note that in this case, further command line
options are disregarded, and should instead be given when starting the R exe-
cutable from inside the debugger.
‘--gui=type’
‘-g type’ Use type as graphical user interface (note that this also includes interactive
graphics). Currently, possible values for type are ‘X11’ (the default), ‘gnome’
provided that gnome support is available, and ‘none’.
Note that input and output can be redirected in the usual way (using ‘<’ and ‘>’).
The command R CMD allows the invocation of various tools which are useful in conjunction
with R, but not intended to be called “directly”. The general form is
R CMD command args
where command is the name of the tool and args the arguments passed on to it.
Currently, the following tools are available.
BATCH Run R in batch mode.
COMPILE Compile files for use with R.
SHLIB Build shared library for dynamic loading.
INSTALL Install add-on packages.
REMOVE Remove add-on packages.
build Build add-on packages.
check Check add-on packages.
LINK Front-end for creating executable programs.
Rprof Post-process R profiling files.
Rdconv Convert Rd format to various other formats, including html, Nroff, LaTEX,
plain text, and S documentation format.
Rd2dvi Convert Rd format to DVI/PDF.
Rd2txt Convert Rd format to text.
Rdindex Extract index information from Rd files.
Sd2Rd Convert S documentation to Rd format.
The first five tools (i.e., BATCH, COMPILE, SHLIB, INSTALL, and REMOVE) can also be
invoked “directly” without the CMD option, i.e., in the form R command args.
Use
R CMD command --help
to obtain usage information for each of the tools accessible via the R CMD interface.
Appendix B: Invoking R 87
‘--save’
‘--no-save’
Control whether data sets should be saved or not at the end of the R session.
If neither is given in an interactive session, the user is asked for the desired
behavior when ending the session with q(); in batch mode, one of these must
be specified.
‘--no-site-file’
Suppress reading of the site-wide startup profile.
‘--no-init-file’
Suppress reading of the directory or user’s ‘.Rprofile’ file.
‘--no-environ’
Suppress reading of ‘.Renviron’.
‘--restore’
‘--no-restore’
‘--no-restore-data’
Control whether saved images (file ‘.RData’ in the directory where R was
started) should be restored at startup or not. The default is to restore.
(‘--no-restore’ implies all the specific ‘--no-restore-*’ options.)
‘--no-restore-history’
Control whether the history file (normally file ‘.Rhistory’ in the directory
where R was started, but can be set by the environment variable R_HISTFILE)
should be restored at startup or not. The default is to restore.
‘--vanilla’
Combine ‘--no-save’, ‘--no-restore’, ‘--no-site-file’, ‘--no-init-file’
and ‘--no-environ’.
‘--min-vsize=N ’
‘--max-vsize=N ’
Specify the minimum or maximum amount of memory used for variable size
objects by setting the “vector heap” size to N bytes. Here, N must either be
an integer or an integer ending with ‘M’, ‘K’, or ‘k’, meaning ‘Mega’ (2^20),
(computer) ‘Kilo’ (2^10), or regular ‘kilo’ (1000).
‘--min-nsize=N ’
‘--max-nsize=N ’
Specify the amount of memory used for fixed size objects by setting the number
of “cons cells” to N. See the previous option for details on N. A cons cell takes
28 bytes.
‘--max-mem-size=N ’
Specify a limit for the amount of memory to be used both for R objects and
working areas (‘malloc-ed’ memory). This is set by default to the smaller of
256Mb and the amount of physical RAM in the machine.
‘--quiet’
‘--silent’
‘-q’ Don’t print startup message.
Appendix B: Invoking R 89
R [options]
as listed later in this section. Using [<infile] [>outfile] equivalent to UNIX can be selected at
startup using the standard Macintosh interface but they are not currently fully implemented,
so you can ignore these.
Most options control what happens at the beginning and at the end of an R session.
The startup mechanism is as follows (see also the on-line help for topic ‘Startup’ for more
information).
• Unless ‘--no-environ’ was given, R looks for the file ‘.Renviron’ in the ‘etc’ subdirec-
tory of R application folder. The file found is processed to set environment variables.
It should contain lines of the form ‘name=value’. (See help(Startup) for a precise
description.)
• Then, unless ‘--no-init-file’ was given, R looks for a file called ‘.Rprofile’ in the
‘etc’ subdirectory of R application folder and processes it (as R code).
• It then loads a saved image from file ‘.RData’ if there is one (unless ‘--no-restore’ or
‘--no-restore-data’ was specified). This file looked for in the ‘etc’ subdirectory of
R application folder.
• Finally, if a function .First exists, it is executed. This function (as well as .Last
which is executed at the end of the R session) can be defined in the appropriate startup
profiles, or reside in ‘.RData’.
In addition, there are options for controlling the memory available to the R process (see
the on-line help for topic ‘Memory’ for more information). Users will not normally need to
use these unless they are trying to limit the amount of memory used by R. Please note
that under MacOS, you have to reserve in advance the amount of memory assigned to the
R application from the ‘Finder/Information’ menu as it is usually done for any Macintosh
software. An amount of 32000k is sufficient to run all the demos and examples in the base
distribution.
R accepts the following command-line options.
‘--version’
Print version information to standard output and exit successfully.
‘--save’
‘--no-save’
Control whether data sets should be saved or not at the end of the R session.
If neither is given in an interactive session, the user is asked for the desired
behavior when ending the session with q(); in batch mode, one of these must
be specified.
‘--no-init-file’
Do not read the user’s profile at startup.
‘--restore’
‘--no-restore’
‘--no-restore-data’
Control whether saved images (file ‘:etc:.RData’ in the directory where R
was started) should be restored at startup or not. The default is to restore.
(‘--no-restore’ implies all the specific ‘--no-restore-*’ options.)
Appendix B: Invoking R 91
‘--no-restore-history’
Control whether the history file (normally file ‘.Rhistory’ in the directory
where R was started, but can be set by the environment variable R_HISTFILE)
should be restored at startup or not. The default is to restore.
‘--vanilla’
Combine ‘--no-save’, ‘--no-environ’, ‘--no-site-file’, ‘--no-init-file’
and ‘--no-restore’.
‘--min-vsize=N ’
‘--max-vsize=N ’
Specify the minimum or maximum amount of memory used for variable size
objects by setting the “vector heap” size to N bytes. Here, N must either be
an integer or an integer ending with ‘M’, ‘K’, or ‘k’, meaning ‘Mega’ (2^20),
(computer) ‘Kilo’ (2^10), or regular ‘kilo’ (1000).
‘--min-nsize=N ’
‘--max-nsize=N ’
Specify the amount of memory used for fixed size objects by setting the number
of “cons cells” to N. See the previous option for details on N. A cons cell takes
28 bytes on a 32-bit machine, and usually 56 bytes on a 64-bit machine.
‘--quiet’
‘--silent’
‘-q’ Do not print out the initial copyright and welcome messages.
‘--slave’ Make R run as quietly as possible.
‘--verbose’
Print more information about progress, and in particular set R’s option verbose
to TRUE. R code uses this option to control the printing of diagnostic messages.
Appendix C: The command line editor 92
C.1 Preliminaries
When the gnu readline library is available at the time R is configured for compilation
under UNIX, an inbuilt command line editor allowing recall, editing and re-submission of
prior commands is used. Note: this appendix does not apply to gnome interface under
UNIX, only to the standard command-line interface.
It can be disabled (useful for usage with ess1 ) using the startup option ‘--no-readline’.
Windows versions of R have somewhat simpler command-line editing: see ‘Console’
under the ‘Help’ menu of the gui, and the file ‘README.Rterm’ for command-line editing
under Rterm.exe.
When using R with readline capabilities, the functions described below are available.
Many of these use either Control or Meta characters. Control characters, such as
Control-m, are obtained by holding the hCTRLi down while you press the hmi key, and
are written as C-m below. Meta characters, such as Meta-b, are typed by holding down
hMETAi and pressing hbi, and written as M-b in the following. If your terminal does not have
a hMETAi key, you can still type Meta characters using two-character sequences starting
with ESC. Thus, to enter M-b, you could type hESCihbi. The ESC character sequences are also
allowed on terminals with real Meta keys. Note that case is significant for Meta characters.
C-r text Find the last command with the text string in it.
On most terminals, you can also use the up and down arrow keys instead of C-p and
C-n, respectively.
! +
! ............................................ 9 + ............................................ 8
!= . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
>
% > ............................................ 9
%*% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 >= . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
%o% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
^
& ^ ............................................ 8
& ............................................ 9
&& . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 <
< ............................................ 9
* <= . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
<<- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
* ............................................ 8
A
- abline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
- ............................................ 8 ace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
add1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
anova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56, 57
. aov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 aperm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
.First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
.Last . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 as.data.frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
as.vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
attach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
/ attr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
/ ............................................ 8 attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
avas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
:
: ............................................ 8 B
boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
= break . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
bruto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
== . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
? C
c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7, 10, 26, 28
? ............................................ 4 C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
cbind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
coef . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
| coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
| ............................................ 9 contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
|| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
coplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
cos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
~ crossprod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21, 23
~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Appendix D: Function and variable index 95
D K
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 ks.test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
data.frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
detach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
L
dev.list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 legend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
dev.next. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8, 13
dev.off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
dev.prev. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
dev.set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
deviance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 lm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
diag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 lme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
dim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 locator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
dotchart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 loess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
drop1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
lqs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
lsfit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
E
ecdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 M
edit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 mars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
eigen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 max . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
exp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
N
F NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
NaN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
F ............................................ 9
ncol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
FALSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
nlm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62, 63
fivenum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 nlme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 nrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
O
order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
G ordered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
glm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 outer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
H P
help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
help.search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 par . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
help.start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 paste . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
hist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35, 67 persp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
pictex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56, 65
pmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
I pmin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
identify. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
if . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
ifelse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 postscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 predict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
is.na . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
is.nan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 prod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Appendix D: Function and variable index 96
Q summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35, 56
qqline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36, 67 svd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
qqnorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36, 67
qqplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 T
qr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
T ............................................ 9
R t.test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21, 26
range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
rbind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 tapply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
read.table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
rep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 title . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
repeat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
resid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 TRUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
rlm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
rm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 U
unclass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
S
scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 V
seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 var . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
shapiro.test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 var.test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
sin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
sink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 W
source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 while . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 wilcox.test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
sqrt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
stem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56, 58 X
sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 X11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Appendix E: Concept index 97
A I
Accessing builtin datasets . . . . . . . . . . . . . . . . . . . . 32 Indexing of and by arrays . . . . . . . . . . . . . . . . . . . . 19
Additive models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Indexing vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Arithmetic functions and operators . . . . . . . . . . . . . 8
Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 K
Assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Kolmogorov-Smirnov test. . . . . . . . . . . . . . . . . . . . . 37
B L
Binary operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Least squares fitting . . . . . . . . . . . . . . . . . . . . . . . . . 25
Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Linear equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
C Local approximating regressions . . . . . . . . . . . . . . 64
Character vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Loops and conditional execution . . . . . . . . . . . . . . 41
Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15, 50
Concatenating lists . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Control statements. . . . . . . . . . . . . . . . . . . . . . . . . . . 41
M
Customizing the environment . . . . . . . . . . . . . . . . . 49 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . 23
Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 63
D Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Mixed models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Default values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 N
Diverting input and output . . . . . . . . . . . . . . . . . . . . 6 Named arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Dynamic graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Nonlinear least squares . . . . . . . . . . . . . . . . . . . . . . . 62
E
O
Eigenvalues and eigenvectors. . . . . . . . . . . . . . . . . . 24
Empirical CDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Object orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
One- and two-sample tests . . . . . . . . . . . . . . . . . . . . 37
F Ordered factors . . . . . . . . . . . . . . . . . . . . . . . . . . 16, 54
Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16, 54 Outer products of arrays . . . . . . . . . . . . . . . . . . . . . 22
Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
P
Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
G Probability distributions . . . . . . . . . . . . . . . . . . . . . 34
Generalized linear models . . . . . . . . . . . . . . . . . . . . 58
Generalized transpose of an array . . . . . . . . . . . . . 22
Generic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Graphics device drivers . . . . . . . . . . . . . . . . . . . . . . . 77
Q
Graphics parameters . . . . . . . . . . . . . . . . . . . . . . . . . 71 QR decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Grouped expressions . . . . . . . . . . . . . . . . . . . . . . . . . 41 Quantile-quantile plots . . . . . . . . . . . . . . . . . . . . . . . 36
Appendix E: Concept index 98
R T
Reading data from files . . . . . . . . . . . . . . . . . . . . . . . 31 Tabulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Recycling rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8, 21 Tree based models . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Regular sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Removing objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 U
Robust regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Updating fitted models . . . . . . . . . . . . . . . . . . . . . . . 57
S V
Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Search path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Shapiro-Wilk test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 W
Singular value decomposition . . . . . . . . . . . . . . . . . 24 Wilcoxon test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Student’s ttest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Writing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Appendix F: References 99
Appendix F References
D. M. Bates and D. G. Watts (1988), Nonlinear Regression Analysis and Its Applications.
John Wiley & Sons, New York.
Richard A. Becker, John M. Chambers and Allan R. Wilks (1988), The New S Language.
Chapman & Hall, New York. This book is often called the “Blue Book ”.
John M. Chambers and Trevor J. Hastie eds. (1992), Statistical Models in S. Chapman
& Hall, New York. This is also called the “White Book ”.
Annette J. Dobson (1990), An Introduction to Generalized Linear Models, Chapman and
Hall, London.
Peter McCullagh and John A. Nelder (1989), Generalized Linear Models. Second edition,
Chapman and Hall, London.
John A. Rice (1995), Mathematical Statistics and Data Analysis. Second edition.
Duxbury Press, Belmont, CA.
S. D. Silvey (1970), Statistical Inference. Penguin, London.