An R Companion To Applied Regression 2nd Edition
An R Companion To Applied Regression 2nd Edition
An R Companion To Applied Regression 2nd Edition
COMPANION
to APPLIED
REGRESSION
To the memory of my parents, Joseph and Diana
—J. F.
For my teachers, and especially Fred Mosteller, who I think would have liked
this book
—S. W.
An R
COMPANION
to APPLIED
REGRESSION
John Fox
McMaster University
Sanford Weisberg
University of Minnesota
Copyright © 2011 by SAGE Publications, Inc.
All rights reserved. No part of this book may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying, recording, or by any information storage and retrieval
system, without permission in writing from the publisher.
For information:
QA278.2.F628 2011
519.5'3602855133—dc22 2010029245
10 11 12 13 14 10 9 8 7 6 5 4 3 2 1
Preface
7 Drawing Graphs
7.1 A General Approach to R Graphics
7.1.1 Defining a Coordinate System: plot
7.1.2 Graphics Parameters: par
7.1.3 Adding Graphical Elements: axis, points, lines, text ,
etcetera
7.1.4 Specifying Colors
7.2 Putting It Together: Explaining Nearest-Neighbor Kernel Regression
7.2.1 Finer Control Over Plot Layout
7.3 Lattice and Other Graphics Packages in R
7.3.1 The Lattice Package
7.3.2 Maps
7.3.3 Other Notable Graphics Packages
7.4 Graphics Devices
7.5 Complementary Reading and References
8 Writing Programs
8.1 Defining Functions
8.2 Working With Matrices*
8.2.1 Basic Matrix Arithmetic
8.2.2 Matrix Inversion and the Solution of Linear Simultaneous
Equations
8.2.3 Example: Linear Least-Squares Regression
8.2.4 Eigenvalues and Eigenvectors
8.2.5 Miscellaneous Matrix Computations
8.3 Program Control: Conditionals, Loops, and Recursion
8.3.1 Conditionals
8.3.2 Iteration (Loops)
8.3.3 Recursion
8.4 apply and Its Relatives
8.4.1 To Loop or Not To Loop?
8.5 Illustrative R Programs*
8.5.1 Binary Logistic Regression
8.5.2 Numbers Into Words
8.6 Improving R Programs*
8.6.1 Debugging R Code
8.6.2 Profiling R Functions
8.7 Object-Oriented Programming in R*
8.7.1 The S3 Object System
8.7.2 The S4 Object System
8.8 Writing Statistical-Modeling Functions in R*
8.9 Environments and Scope in R*
8.9.1 Basic Definitions: Frames, Environments, and Scope
8.9.2 Lexical Scoping in R
8.10 R Programming Advice
8.11 Complementary Reading and References
References
Author Index
Subject Index
Command Index
Data Set Index
Package Index
About the Authors
Preface
One of the great strengths of R is that it allows users and experts in particular
areas of statistics to add new capabilities to the software. Not only is it possible
to write new programs in R, but it is also convenient to combine related sets of
programs, data, and documentation in R packages. The previous edition of this
book, published in 2002, touted the then “more than 100 contributed packages
available on the R website, many of them prepared by experts in various areas of
applied statistics, such as resampling methods, mixed models, and survival
analysis” (p. xii). The Comprehensive R Archive Network (abbreviated CRAN
and variously pronounced see-ran or kran)now holds more than 2,500 packages
(see Figure 1, drawn, of course, with R); other R package archives—most
notably the archive of the Bioconductor project, which develops software for
bioinformatics—add several hundred more packages to the total. In the statistical
literature, new methods are often accompanied by implementations in R; indeed,
R has become a kind of lingua franca of statistical computing—at least among
statisticians—although interest in R is also robust in other areas, including the
social and behavioral sciences.
Readers familiar with the first edition of this book will immediately notice two
key changes. First, and most significant, there are now two authors, the first
edition having been written by John Fox alone. Second, “S-PLUS” is missing
from the title of the book (originally An R and S-PLUS Companion to Applied
Regression), which now describes only R. In the decade since the first edition of
the book was written, the open-source, free R has completely eclipsed its
commercial cousin, S-PLUS. Moreover, where R and S-PLUS differ, we believe
that the advantage generally goes to R. Although most of the contents of this
second edition are applicable to S-PLUS as well as to R, we see little reason to
discuss S-PLUS explicitly.
We have added a variety of new material—for example, with respect to
transformations and effects plots—and in addition, virtually all the text has been
rewritten. We have taken pains to make the book as self-contained as possible,
providing the information that a new user needs to get started. Many topics, such
as R graphics (in Chapter 7) and R programming (in Chapter 8), have been
considerably expanded in the second edition.
Figure 1 The number of packages on CRAN grew roughly exponentially since reliable data first became
available in 2001 through 2009.
Source: Fox (2009).
The book has a companion R package called car, and we have substantially
added to, extended, and revised the functions in the car package to make them
more consistent, easier to use, and, we hope, more useful. The new car package
includes several functions inherited from the alr3 package designed to
accompany Weisberg (2005). The alr3 package still exists, but it now contains
mostly data.
Permanently select a CRAN mirror site, so that you don’t have to specify
the mirror in each session that you install or update packages; just
uncomment the following lines (with the exception of the first) in the
Rprofile.site file by removing the pound signs (# ):
# set a CRAN mirror
# local({r <- getOption("repos")
# r["CRAN"] <- "http://my.local.cran"
# options(repos=r)})
You must then replace the dummy site http://my.local.cran with a link
to a real mirror site, such as http://probability.ca/cran for the CRAN
mirror at the University of Toronto. This is, of course, just an example: You
should pick a mirror site near you.
Whenever you start R, automatically update any installed packages for
which new versions are available on CRAN; just insert the following line
into Rprofile.site:
utils::update.packages(ask=FALSE)
A disadvantage of the last change is that starting up R will take a bit longer. If
you find the wait annoying, you can always remove this line from your
Rprofile.site file.
Edit the Rprofile.site file with a plain-text (ASCII) editor, such as Windows
Notepad; if you use a word-processing program, such as Word,make sure to save
the file as plain text.
You can also customize certain aspects of the R graphical user interface via
the Edit → GUI preferences menu.
Permanently select a CRAN mirror site, so that you don’t have to specify
the mirror in each session that you install or update a package. From the
menus in the R Console, select R → Preferences, and then select the Startup
tab. Pick the URL of a mirror site near you.
Whenever you start R, automatically update any installed packages for
which new versions are available. Using a text editor capable of saving
plain-text (ASCII) files (we recommend the free Text Wrangler, which can
also be configured as a programming editor for R), create a file named
.Rprofile in your home directory, being careful not to omit the initial period
(.), and insert the following line in the file:
utils::update.packages(ask=FALSE)
This command also loads all the packages on which the car package depends. If
you want to use still other packages, you need to enter a separate library
command for each. The process of loading packages as you need them will come
naturally as you grow more familiar with R. You can also arrange to load
packages automatically at the beginning of every R session by adding a pair of
commands such as the following to your R profile:
pkgs <- getOption("defaultPackages")
options(defaultPackages = c(pkgs, "car", "alr3"))
All these can be accessed using the carWeb function; after loading the car
package in R, type help(carWeb) for details.
Input and output are printed in slanted and upright monospaced (typewriter)
fonts, respectively—for example,
> mean(1:10) # an input line
[1] 5.5
The > prompt at the beginning of the input and the + prompt (not illustrated
in this example), which begins continuation lines, are provided by R, not
typed by the user.
R input and output are printed as they appear on the computer screen,
although we sometimes edit output for brevity or clarity; elided material in
computer output is indicated by three widely spaced periods (… ).
Data set names, variable names, the names of R functions and operators,
and R expressions that appear in the body of the text are in a monospaced
(typewriter) font: Duncan , income , mean , + , lm(prestige~income +
education, data=Prestige) .
The names of R packages are in boldface: car.
Occasionally, generic specifications (to be replaced by particular
information, such as a variable name) are given in typewriter italics: mean (
variable-name ) .
Menus, menu items, and the names of windows are set in an italic sans-serif
font: File, Exit, R Console.
We use a sans-serif font for other names, such as names of operating
systems, programming languages, software packages, and directories:
Windows, R, SAS, c:\ Program Files\ R\ R-2.11.0\ etc.
Graphical output from R is shown in many figures scattered through the text;
in normal use, graphs appear on the computer screen in graphics device windows
that can be moved, resized, copied into other programs, saved, or printed (as
described in Section 7.4).
There is, of course, much to R beyond the material in this book. The S
language is documented in several books by John Chambers and his colleagues:
The New S Language: A Programming Environment for Data Analysis and
Graphics (Becker et al., 1988) and an edited volume, Statistical Models in S
(Chambers and Hastie, 1992), describe what came to be known as S3, including
the S3 object-oriented programming system, and facilities for specifying and
fitting statistical models. Similarly, Programming With Data (Chambers, 1998)
describes the S4 language and object system. The R dialect of S incorporates
both S3 and S4, and so these books remain valuable sources.
Beyond these basic sources, there are now so many books that describe the
application of R to various areas of statistics that it is impractical to compile a
list here, a list that would inevitably be out-of-date by the time this book goes to
press. We include complementary readings at the end of many chapters,
however. There is nevertheless one book that is especially worthy of mention
here: The fourth edition of Modern Applied Statistics With S (Venables and
Ripley, 2002), though somewhat dated, demonstrates the use of R for a wide
range of statistical applications. The book is associated with several R packages,
including the MASS package, to which we make occasional reference. Venables
and Ripley’s text is generally more advanced and has a broader focus than our
book. There are also some differences in emphasis: For example, the R
Companion has more material on diagnostic methods.
Acknowledgments
We are grateful to a number of individuals who provided valuable assistance in
writing this book and its predecessor:
1.1 R Basics
Figure 1.1 shows the RGui (R Graphical User Interface) for the Windows
version of R. The most important element of the Rgui is the R Console
window, which initially contains an opening message followed by a line with
just a command prompt—the greater than ( > ) symbol. Interaction with R
takes place at the command prompt. In Figure 1.1, we typed a simple
command, 2 + 3 , followed by the Enter key. R interprets and executes the
command, returning the value 5 , followed by another command prompt.
Figure 1.2 shows the similar R.app GUI for the Mac OS X version of R.
The menus in RGui and R.app provide access to many routine tasks, such
as setting preferences, various editing functions, and accessing
documentation. We draw your attention in particular to the Packages menu in
the Windows RGui and to the Packages & Data menu in the Mac OS X
R.app, both of which provide dialogs for installing and updating R packages.
Unlike many statistical analysis programs, the standard R menus do not
provide direct access to the statistical functions in R, for which you will
generally have to enter commands at the command prompt.
Figure 1.1 The RGui interface to the Windows version of R, shortly after the beginning of a session. This
screen shot shows the default multiple-document interface (MDI); the single-document interface (SDI)
looks similar but consists only of the R Console with the menu bar.
[1] 5
> 2 − 3 # subtraction
[1] −1
[1] 6
[1] 0.6667
[1] 8
Output lines are preceded by [1] . When the printed output consists of many
values spread over several lines, each line begins with the index number of the
first element in that line; an example will appear shortly. After the interpreter
executes a command and returns a value, it waits for the next command, as
signified by the > prompt. The pound sign ( # ) signifies a comment: Text to the
right of # is ignored by the interpreter. We often take advantage of this feature to
insert explanatory text to the right of commands, as in the examples above.
Several arithmetic operations may be combined to build up complex
expressions:
> 4^2 − 3*2
[1] 10
[1] −1
[1] 10
and
> (4 + 3)^2
[1] 49
is different from
>4+3^2
[1] 13
[1] 1
> −2 − −3
[1] 1
1.1.2 R FUNCTIONS
In addition to the common arithmetic operators, R includes many—literally
hundreds—of functions for mathematical operations, for statistical data analysis,
for making graphs, and for other purposes. Function arguments are values passed
to functions, and these are specified within parentheses after the function name.
For example, to calculate the natural log of 100, that is loge 100 or ln 100, we
type
> log(100)
[1] 4.605
[1] 2
[1] 2
[1] 2
To obtain information about a function, use the help function. For example,
> help(log)
A novel feature of the R help system is the facility it provides to execute most
examples in the help pages via the example command:
> example("log")
log> log(exp(3))
[1] 3
log> log10(1e7) # = 7
[1] 7
The number 1e7 in the last example is given in scientific notation and represents
1 × 107 = 10 million.
A quick way to determine the arguments of many functions is to use the args
function:
> args(log)
Because base is the second argument of the log function, we can also type
> log(100, 10)
[1] 2
[1] 5
We need to place back-ticks around ‘+‘ (single or double quotes also work) so
that the interpreter does not get confused, but our ability to use + and the other
arithmetic functions as in-fix operators, as in 2 + 3 , is really just syntactic
“sugar,” simplifying the construction of R expressions but not fundamentally
altering the functional character of the language.
[1] 1 2 3 4
Many other functions also return vectors as results. For example, the sequence
operator ( : ) generates consecutive numbers, while the sequence function ( seq
) does much the same thing, but more flexibly:
> 1:4 # integer sequence
[1] 1 2 3 4
> 4:1
[1] 4 3 2 1
> −1:2
[1] −1 0 1 2
> seq(1, 4)
[1] 1 2 3 4
[1] 2 4 6 8
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
The standard arithmetic operators and functions apply to vectors on an
element-wise basis:
> c(1, 2, 3, 4)/2
[1] −1 0 1 2
If the operands are of different lengths, then the shorter of the two is extended by
repetition, as in c(1, 2, 3, 4)/2 above; if the length of the longer operand is
not a multiple of the length of the shorter one, then a warning message is printed,
but the interpreter proceeds with the operation, recycling the elements of the
shorter operand:
> c(1, 2, 3, 4) + c(4, 3) # no warning
[1] 5 5 7 7
[1] 5 5 5 8
Warning message:
In c(1, 2, 3, 4) + c(4, 3, 2) :
longer object length is not a multiple of shorter object length
R would also be of little use if we were unable to save the results returned by
functions; we do so by assigning values to variables, as in the following
example:
> x <− c(1, 2, 3, 4) # assignment
> x # print
[1] 1 2 3 4
In the last example, sqrt is the square-root function, and thus sqrt(x) is
equivalent to x^0.5 . To obtain printed output without having to type the name
of the variable y as a separate command, we enclose the command in
parentheses so that the assignment is no longer the leftmost operation. We will
use this trick regularly.
Unlike in many programming languages, variables in R are dynamically defined.
We need not tell the interpreter in advance how many values x is to hold or
whether it contains integers (whole numbers), real numbers, character values, or
something else. Moreover, if we wish, we may redefine the variable x :
(x <− rnorm(100)) # 100 standard normal random numbers
summary prints the minimum and maximum values of its argument, along with
the mean, median, and first and third quartiles. Applied to another kind of object
—a matrix, for example— summary gives different information, as we will see
later.
is a character vector whose elements are character strings. There are R functions
to work with character data. For example, to turn this vector into a single
character string:
> paste(words, collapse=" ")
The very useful paste function pastes strings together (and is discussed, along
with other functions for manipulating character data, in Section 2.4). The
collapse argument, as its name implies, collapses the character vector into a
single string, separating the elements with whatever is between the quotation
marks, in this case one blank space.
A logical vector has all its elements either TRUE or FALSE :
> (vals <− c(TRUE, TRUE, FALSE, TRUE))
If we use logical values in arithmetic, R treats FALSE as if it were a zero and TRUE
as if it were a one:
> sum(vals)
[1] 3
> sum(!vals)
[1] 1
A vector of mixed numbers and logical values is treated as numeric, with FALSE
becoming zero and TRUE becoming one. (Try it!) In the first case, we say that the
logical and numeric values are coerced to character; in the second case, the
logical values are coerced to numeric. In general, coercion in R takes place
naturally and is designed to lose as little information as possible (see Section
2.6).
[1] 1.817
> words[2] # second element
[1] "be"
[1] FALSE
The parentheses around 11:100 serve to avoid generating numbers from −11 to
100, which would result in an error. (Try it!)
A vector can also be indexed by a logical vector of the same length. Logical
values frequently arise through the use of comparison operators:
== equals
!= not equals
<= less than or equals
< less than
> greater than
>= greater than or equals
The double-equals sign ( == ) is used for testing equality, because = is reserved
for specifying function arguments and for assignment.
Logical values may also be used in conjunction with the logical operators:
& and
| or
Here are some simple examples:
> 1 == 2
[1] FALSE
> 1 != 2
[1] TRUE
> 1 <= 2
[1] TRUE
> 1 <1:3
A somewhat more extended example illustrates the use of the comparison and
logical operators:
> (z <− x[1:10])
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
> z >0.5
[1] TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
> z < −0.5 | z > 0.5 # < and > of higher precedence than |
[1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
[1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
The last of these commands uses the ! operator, introduced in the last section, to
negate the logical values returned by abs(z) > 0.5 and thus returns the
observations for which the condition is FALSE .
A few pointers about using these operators:
We need to be careful in typing z < −0.5 ; although most spaces in R
commands are optional, the space after < is crucial: z <−0.5 would assign
the value 0.5 to z . Even when spaces are not required around operators,
they usually help to clarify R commands.
Logical operators have lower precedence than comparison operators, and so
z < −0.5 | z > 0.5 is equivalent to (z < −0.5) | (z > 0.5) . When in
doubt, parenthesize!
The abs function returns the absolute value of its argument.
As the last two commands illustrate, we can index a vector by a logical
vector of the same length, selecting the elements with TRUE indices.
[1] 0.2452
[1] 0.2452
Having defined the function myMean , we may use it in the same manner as the
standard R functions. Indeed, most of the standard functions in R are themselves
written in the R language.6
> myMean(x)
[1] 0.2452
> myMean(y)
[1] 1.537
> myMean(1:100)
[1] 50.5
> myMean(sqrt(1:100))
[1] 6.715
You can move the cursor with the left and right arrow, Home,and End keys.
The Delete key deletes the character under the cursor.
The Backspace key deletes the character to the left of the cursor.
The standard Windows Edit menu and keyboard shortcuts may be
employed, along with the mouse, to block, copy, and paste text.
In addition, R implements a command-history mechanism that allows you
to recall and edit previously entered commands without having to retype
them. Use the up and down arrow keys to move backward and forward in
the command history. Press Enter in the normal manner to submit a
recalled, and possibly edited, command to the interpreter.
[1] 841.7
[1] 841.7
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"
[17] "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> myVar(letters)
The built-in variable letters contains the lowercase letters, and of course,
calculating the variance of character data makes no sense. Although the
source of the problem is obvious, the error occurs in the sum function, not
directly in myVar ; traceback shows the sequence of function calls
culminating in the error:
> traceback()
2: myMean(x)
1: myVar(letters)
Not all errors generate error messages. Indeed, the ones that do not are
more pernicious, because you may fail to notice them. Always check your
output for reasonableness, and follow up suspicious results.
If you need to interrupt the execution of a command, you may do so by
pressing the Esc (escape) key, by using the mouse to press the Stop button
in the toolbar, or (under Windows) by selecting the Misc → Stop current
computation menu item.
There is much more information on debugging R code in Section 8.6.1.
…
[7] "dlogis" "is.logical"
[9] "log" "log10"
[11] "log1p" "log2"
[13] "logb" "Logic"
[15] "logical" "logLik"
…
Casting a broader net, the help.search command searches the titles and
certain other fields in the help files of all R packages installed on your
system, showing the results in a pop-up window. For example, try the
command help.search("loglinear") to find functions related to loglinear
models (discussed in Section 5.6). The ?? operator is a synonym for
help.search —for example, ??loglinear .
If you have an active Internet connection, you can search even more
broadly with the RSiteSearch function. For example, to look in all standard
and CRAN packages—even those not installed on your system—for
functions related to loglinear models, you can issue the command
RSiteSearch("loglinear", restrict= "functions") . The results
appear in a web browser. See ?RSite-Search for details.
The CRAN task views are documents that describe facilities in R for
applications in specific areas such as Bayesian statistics, econometrics,
psychometrics, social statistics, and spatial statistics. The approximately
two-dozen task views are available via the command carWeb
("taskviews") , which uses the carWeb function from the car package, or
directly by pointing your browser at http://cran.r-
project.org/web/views/ .
The command help(package=" package-name ") —for example,
help(package="car") —shows information about an installed package,
such as an index of help topics documented in the package.
Some packages contain vignettes, discursive documents describing the use
of the package. To find out what vignettes are available in the packages
installed on your system, enter the command vignette() . The command
vignette(package=" package-name ") displays the vignettes available in a
particular installed package, and the command vignette(" vignette-name
") or vignette(" vignette-name ", package=" package-name ") opens a
specific vignette.
The Help menu in the Mac OS X and Windows versions of R provides self-
explanatory menu items to access help pages, online manuals, the apropos
function, and links to the R websites.
As you might expect, help on R is available on the Internet from a wide
variety of sources. The website www.rseek.org provides a custom Google
search engine specifically designed to look for R-related documents (try
searching for car using this search site). The page www.r-
project.org/search.html lists other possibilities for web searching.
Finally, Rhelp is a very active email list devoted to answering users’
questions about R, and there are also several more specialized R mailing
lists (see www.r-project.org/mail.html ). Before posting a question to
Rhelp or to one of the other email lists, however, please carefully read the
posting guide at www.r-project.org/posting-guide.html .
1.1.10 CLEANING UP
User-defined variables and functions exist in R in a region of memory called
the workspace. The R workspace can be saved at the end of a session or even
during the session, in which case it is automatically loaded at the start of the next
session. Different workspaces can be saved in different directories, as a means of
keeping projects separate. Starting R in a directory loads the corresponding
workspace.11
The objects function lists the names of variables and functions residing in the
R workspace:
> objects()
We keep the functions myMean and myVar , pretending that we still intend to use
them.
Answering y will save the workspace in the current directory, an operation that
we generally do not recommend;12 use n to avoid saving the workspace or c to
cancel quitting. Entering quit(save="n") suppresses the question. You can also
exit from R via the File menu or by clicking on the standard close-window
button—the red button at the upper right in Windows and the upper left in Mac
OS X.
In the input data file, the variable type contains character data, which
read.table by default converts into a factor—an R representation of categorical
data. The summary function simply counts the number of observations in each
level (category) of the factor. The variables income , education , and prestige
are numeric, and the summary function reports the minimum, maximum, median,
mean, and first and third quartiles for each numeric variable.
To access a specific variable in the data frame, we need to provide its fully
qualified name—for example, Duncan$prestige for the variable prestige in the
Duncan data frame. Typing the full name can get tedious, and we can avoid this
repetition by using the attach function. Attaching the Duncan data frame allows
us to access its columns by name, much as if we had directly defined the
variables in the R workspace:
> attach(Duncan)
> prestige
[1] 82 83 90 76 90 87 93 90 52 88 57 89 97 59 73 38 76 81 45 92
[21] 39 34 41 16 33 53 67 57 26 29 10 15 19 10 13 24 20 7 3 16
[41] 6 11 8 41 10
Reading and manipulating data is the subject of Chapter 2, where the topic is
developed in much greater detail. In particular, in Section 2.2 we will show you
generally better ways to work with data frames than to attach them.
The function hist doesn’t return a value in the R console but rather is used for
the side effect of drawing a graph, in this case a histogram.15 The
histogram may be copied to the clipboard, saved to a file, or printed (see
Section 1.1.7).
The distribution of prestige appears to be bimodal, with observations
stacking up near the lower and upper boundaries. Because prestige is a
percentage, this behavior is not altogether unexpected—many occupations will
either be low prestige, near the lower boundary, or high prestige, near the upper
boundary, with fewer occupations in the middle. Variables such as this often
need to be transformed, perhaps with a logit (log-odds) or similar
transformation. As it turns out, however, it will prove unnecessary to transform
prestige .
We should also examine the distributions of the predictor variables, along with
the relationship between prestige and each predictor, and the relationship
between the two predictors. The pairs function in R draws a scatterplot matrix.
The pairs function is quite flexible, and we take advantage of this flexibility by
placing histograms for the variables along the diagonal of the graph. To better
discern the pairwise relationships among the variables, we augment each
scatterplot with a least-squares line and with a nonparametric-regression
smooth:16
> pairs(cbind(prestige, income, education),
+ panel=function(x, y){
+ points(x, y)
+ abline(lm(y ~ x), lty="dashed")
+ lines(lowess(x, y))
+ },
Figure 1.5 Scatterplot matrix for prestige , income , and education from Duncan’s data.
+ diag.panel=function(x){
+ par(new=TRUE)
+ hist(x, main="", axes=FALSE)
+ }
+ )
Don’t let the apparent complexity of this command worry you. Most graphs that
you will want to draw are much simpler to produce than this one. Later in this
Companion, we will describe functions in the car package that simplify the
drawing of interesting graphs, including scatterplot matrices. Nevertheless, this
call to pairs allows us to illustrate the structure of commands in R:
3. lines(lowess(x, y)) draws a solid line, the default line type, showing the
nonparametric regression of y on x . The lowess function computes and
returns coordinates for points on a smooth curve relating y to x ; these
coordinates are passed as an argument to lines , which connects the points
with line-segments on the graph.
Because there is more than one R command in the function body, these
commands are enclosed as a block in curly braces, { and } . We indented the
lines in the command to reveal the structure of the R code; this convention is
optional but advisable. If no panel function is specified, then panel defaults to
points . Try the simple command:
> pairs(cbind(prestige, income, education))
or, equivalently,
> pairs(Duncan[ ,−1])
This latter form uses all the columns in the Duncan data set except the first.
1. par(new=TRUE) prevents the hist function from trying to clear the graph.
High-level R plotting functions, such as plot , hist , and pairs , by default
clear the current graphics device prior to drawing a new plot. Lower-level
plotting functions, such as points , abline , and lines , do not clear the
current graphics device by default but rather add elements to an existing
graph (see Section 7.1 for details).
The special formal argument … (the ellipses) will match any number of actual
arguments when the function is called—for example,
> scatmat(prestige, income, education)
produces a graph identical to the one in Figure 1.5. The scatterplot-Matrix
function in the car package (described in Section 3.3.2) is considerably more
flexible than the scatmat function just defined.
In many graphs, we would like to identify unusual points by marking them
with a descriptive label. Point identification in R is easier in a scatterplot than in
a scatterplot matrix, and so we draw a separate scatterplot for education and
income :
[1] 6 16 27
Call:
lm(formula = prestige ~ income + education)
Coefficients:
(Intercept) income education
−6.065 0.599 0.546
Because we previously attached the Duncan data frame, we can access the
variables in it by name. The argument to lm is a linear-model formula, with the
response variable, prestige , on the left of the tilde ( ~ ). The right-hand side of
the model formula specifies the predictor variables in the regression, income and
education . We read the formula as “ prestige is regressed on income and
education .”
The lm function returns a linear-model object, which we assign to the variable
duncan.model . As we explained in Section 1.1.3, enclosing the assignment in
parentheses causes the assigned object to be printed, here producing a brief
report of the results of the regression. The summary function produces a more
complete report:
> summary(duncan.model)
Call:
lm(formula = prestige ~ income + education)
Both income and education have highly statistically significant, and rather
large, regression coefficients: For example, holding education constant, a 1%
increase in higher income earners is associated on average with an increase of
about 0.6% in high prestige ratings.
R writes very small and very large numbers in scientific notation. For
example, 1.1e−05 is to be read as 1.1 × 10−5, or 0.000011, and 2e−16 = 2 ×
10−16, which is effectively zero.
If you find the statistical significance asterisks that R prints annoying, as we
do, you can suppress them, as we will in the remainder of this Companion, by
entering the command:
> options(show.signif.stars=FALSE)
As usual, placing this command in one of R’s start-up files will permanently
banish the offending asterisks (see the discussion of configuring R in the
Preface).20 Linear models are described in much more detail in Chapter 4.
Loading the car package also loads some other packages on which car depends.
The lm object duncan.model contains a variety of information about the
regression. The rstudent function uses some of this information to calculate
Studentized residuals for the model. A histogram of the Studentized residuals
(Figure 1.7) is unremarkable:
> hist(rstudent(duncan.model))
Observe the sequence of operations here: rstudent takes the linear-model object
duncan.model , previously returned by lm , as an argument and returns the
Studentized residuals, which are passed to hist , which draws the histogram.
If the errors in the regression are normally distributed with zero means and
constant variance, then the Studentized residuals are each t-distributed with n − k
− 2 degrees of freedom, where k is the number of coefficients in the model
excluding the regression constant and n is the number of observations. The
generic qqPlot function from the car package, which makes quantile-
comparison plots, has a method for linear models:
> qqPlot(duncan.model, labels=row.names(Duncan), id.n=3)
Figure 1.8 Quantile-comparison plot for the Studentized residuals from the regression of prestige on
income and education . The broken lines show a bootstrapped pointwise 95% confidence envelope for the
points.
The resulting plot is shown in Figure 1.8. The qqPlot function extracts the
Studentized residuals and plots them against the quantiles of the appropriate t-
distribution. If the Studentized residuals are t-distributed and n − k − 2 is large
enough so that we can ignore the correlation between the Studentized residuals,
then the points should lie close to a straight line. The comparison line on the plot
is drawn by default by robust regression. In this case, the residuals pull away
slightly from the comparison line at both ends, suggesting that the residual
distribution is a bit heavy-tailed. By default, qqPlot produces a bootstrapped
pointwise 95% confidence envelope for the Studentized residuals. The residuals
nearly stay within the boundaries of the envelope at both ends of the distribution.
Figure 1.9 Index plots of Cook’s distance and hat-values, from the regression of prestige on income and
education .
Most of the graphical methods in the car package have arguments that modify
the basic plot. For example, the grid lines on the graph are added by default to
most car graphics; you can suppress the grid lines with the argument
grid=FALSE .21
The car graphics functions also have arguments that are used to identify
points by their labels. In Figure 1.8 we set the argument labels=row.-
names(Duncan) to tell the function the labels to use and the argument id.n=3 to
label three points with values on the horizontal axis farthest from the mean on
the horizontal axis. The default in car graphical functions is id.n=0 , to suppress
point identification. Section 3.5 provides a more complete discussion of point
labeling.
We proceed to check for high-leverage and influential observations by plotting
hat-values (Section 6.3.2) and Cook’s distances (Section 6.3.3) against the
observation indices:
> influenceIndexPlot(duncan.model, vars=c("Cook", "hat"), id.n=3)
The plots are shown in Figure 1.9. We used the id.n=3 argument to label the
three most extreme points in each figure. Points are labeled by row number by
default. Our attention is drawn to cases 6 and 16, which are flagged in both
graphs, and which correspond to the following occupations:
> rownames(Duncan)[c(6, 16)]
[1] "minister" "contractor"
Figure 1.10 Added-variable plots for income and education in Duncan’s occupational-prestige regression.
Each added-variable plot displays the conditional, rather than the marginal,
relationship between the response and one of the predictors. Points at the
extreme left or right of the plot correspond to points that are potentially
influential, and possibly jointly influential. Figure 1.10 confirms and strengthens
our previous observations: We should be concerned about the occupations
minister (6) and conductor (16), which work together to decrease the income
coefficient and increase the education coefficient. Occupation RR.engineer
(27) has relatively high leverage on these coefficients but is more in line with the
rest of the data. The argument id.cex=0.75 makes the labels smaller to fit well
into the plot. By specifying id.n=3 , the avPlots function automatically labels
the three most extreme points on the horizontal axis and the three points with the
largest residuals.
We next use the crPlots function, also in the car package, to generate
component-plus-residual plots for income and education (as discussed in
Section 6.4.2):
> crPlots(duncan.model, span=0.7)
Figure 1.11 Component-plus-residual plots for income and education in Duncan’s occupational-prestige
regression. The span of the nonparametric-regression smoother was set to 0.7.
Figure 1.12 Spread-level plot of Studentized residuals from Duncan’s regression of prestige on income
and education .
Both tests are far from statistically significant, indicating that the assumption of
constant variance is tenable.
Finally, on the basis of the influential-data diagnostics, we try removing the
observations minister and conductor from the regression:
> summary(update(duncan.model, subset=-c(6, 16)))
Call:
lm(formula = prestige ~ income + education, subset = -c(6, 16))
Rather than respecifying the regression model from scratch, we refit it using the
update function, removing the two potentially problematic observations via the
subset argument to update . The coefficients of income and education have
changed substantially with the deletion of these two observations. Further work
(not shown) suggests that removing occupations 27 ( RR.engineer ) and 9(
reporter ) does not make much of a difference.
Chapter 6 has much more extensive information on regression diagnostics in
R, including the use of the various functions in the car package.
1.3 R Functions for Basic Statistics
Many of the most commonly used functions in R, such as summary , print , and
plot , can have very different actions depending on the arguments passed to the
function.22 For example, the summary function applied to different columns of
the Duncan data frame produces different output. The summary for the variable
Duncan$type is the count in each level of this factor,
> summary(Duncan$type)
bc prof wc
21 18 6
while for a numeric variable, the summary includes the mean, minimum,
maximum, and several quantiles:
> summary(Duncan$prestige)
[1] "factor"
> class(Duncan$prestige)
[1] "integer"
> class(Duncan)
[1] "data.frame"
[1] "lm"
function (object, …)
UseMethod("summary")
<environment: namespace:base>
The generic function summary has one required argument, object , and the
special argument … (the ellipses) for additional arguments that could be different
for each summary method. When UseMethod("summary") is applied to an object
of class "lm" , for example, R searches for a method function named summary.lm
and, if it is found, executes the command summary.lm(object, …) . It is,
incidentally, perfectly possible to call summary.lm directly; thus, the following
two commands are equivalent:
> summary(duncan.model)
> summary.lm(duncan.model)
Although the generic summary function has only one explicit argument, the
method function summary.lm has additional arguments:
> args(summary.lm)
function (object, correlation = FALSE, symbolic.cor = FALSE,
…)
NULL
Because the arguments correlation and symbolic.cor have default values (
FALSE , in both cases), they need not be specified. Any additional arguments that
are supplied, which are covered by … , could be passed to functions that might be
called by summary.lm .
Although in this instance we can call summary.lm directly, many method
functions are hidden in the namespaces of packages and cannot normally be used
directly.25 In any event, it is good R form to use method functions indirectly
through their generics.
Suppose that we invoke the hypothetical generic function fun with argument
arg of class "cls" . If there is no method function named fun.cls , then R looks
for a method named fun=default . For example, objects belonging to classes
without summary methods are printed by summary.-default . If, under these
circumstances, there is no method named fun.-default , then R reports an error.
We can get a listing of all currently accessible method functions for the
generic summary using the methods function, with hidden methods flagged by
asterisks:
> methods(summary)
These methods may have different arguments beyond the first, and some method
functions, for example, summary.lm , have their own help pages: ?summary.lm .
Method selection is slightly more complicated for objects whose class is a
vector of more than one element. Consider, for example, an object returned by
the glm function (anticipating a logistic-regression example developed in Section
5.3):
> mod.mroz <− glm(lfp ~ ., family=binomial, data=Mroz)
> class(mod.mroz)
There are currently several statistical GUIs to R, the most extensive of which is
the R Commander, implemented in the Rcmdr package.26 The R Commander
began as an interface to be used in an elementary statistics course but has
expanded beyond that original purpose. Most of the statistical analysis described
in this book can be carried out using menu items in the R Commander.
The R Commander, or any other well-designed statistical GUI, can help
students learn new ideas, by separating the need for memorizing computer
commands from the corresponding statistical concepts. A GUI can also assist a
user who is familiar with the statistical ideas but not with R to make substantial
and rapid progress. Finally, the infrequent user of R may find that a GUI
provides access to the program without the burden of learning and relearning R
commands.
With the good comes the bad:
For example, to read the data from the file Duncan.txt into an R data frame,
and to make that data frame the active data set in the R Commander, select Data
→ Import data → from text file, clipboard, or URL, and then complete the
resulting dialog box, which allows you to navigate to the location of the data file.
The R Commander generates and executes an appropriate command, which it
also writes into its Script Window. Commands and associated printed output
appear in the Output Window, and error and other messages in the Messages
window. Graphs appear in a standard R graphics device window.
To continue the example, to perform a least-squares regression of prestige on
income and education , select Statistics → Fit models → Linear regression (or
Linear model), and complete the dialog. The R Commander generates an lm
command. The linear-model object produced by the command becomes the
active model in the R Commander, and various tests, graphs, and diagnostics can
subsequently be accessed under the Model menu.
For more information, see the introductory manual provided by Help →
Introduction to the R Commander and the help page produced by Help →
Commander help. To install the Rcmdr package, use the command
install.packages("Rcmdr", dependencies=TRUE) , and be prepared to wait
awhile as the many direct and indirect dependencies of the Rcmdr are
downloaded from CRAN and installed on your system.
_________________________________
1Section 1.1.6 briefly discusses user-defined functions; the topic is treated in greater depth in Chapter 8.
Experienced programmers can also access programs written in Fortran and C from within R.
2We refer to vectors as “lists” using that term loosely, because lists in R are a distinct data structure
(described in Section 2.3).
3R also permits a right-pointing arrow for assignment, as in 2+3->
x.
4Nonstandard names may be used in a variety of contexts, including assignments, by enclosing the names
in back-ticks, or in single or double quotes (e.g., ’first name’ <− "John" ). In most circumstances,
however, nonstandard names are best avoided.
5We could not resist writing that sentence! Actually, however, function is a special form, not a true
function, because its arguments (here, the formal argument x ) are not evaluated. The distinction is
technical, and it will do no harm to think of function as a function that returns a function as its result.
6Some of the standard R functions are primitives, in the sense that they are defined in code written in the
lower-level languages C and Fortran.
7The menu selection Help → Console will display these hints.
8See Section 7.4 for more information on handling graphics devices in R.
9Sometimes, however, this testing may convince you that the published results are wrong, but that is
another story.
10In addition, we have already introduced the help.start command, and in Section 4.9, we describe the
use of the hints function in the hints package to obtain information about functions that can be used with a
particular R object.
11See the R documentation for additional information on organizing separate projects.
12A saved workspace will be loaded automatically in a subsequent session, a situation that often results in
confusion, in our experience, especially among new users of R. We therefore recommend that you start each
R session with a pristine workspace and instead save the script of the commands you use during a session
that you may wish to recover (see the discussion of programming editors in Section 1.1.7). Objects can then
conveniently be re-created as needed by executing the commands in the saved script. Admittedly, whether
to save workspaces or scripts of commands is partly a matter of preference and habit.
13The Duncan.txt file, along with the other files used in this text, are available on the website for this
Companion, at the web address given in the Preface. To reproduce the example, download the data file to a
convenient location on your hard disk. Alternatively, you can open a copy of the file in your web browser
with the command carWeb(data="Duncan.txt") and then save it to your disk.
14As we will explain in Chapter 2, we can read data into R from a very wide variety of sources and formats.
The format of the data in Duncan.txt is particularly simple, however, and furnishes a convenient initial
example.
15Like all functions, hist does return a result; in this case, however, the result is invisible and is a list
containing the information necessary to draw the histogram. To render the result visible, put parentheses
around the command: (hist(prestige)) . Lists are discussed in Section 2.3.
16Nonparametric regression is discussed in the online appendix to the book. Here, the method is used
simply to pass a smooth curve through the data.
17The function is termed anonymous because it literally is never given a name: The function object
returned by function is left unassigned.
18Chapter 7 discusses the construction of R graphics, including the selection of line types.
19Control doesn’t return to the R command prompt until you exit from point-identification mode. New
users of R occasionally think that R has frozen when they simply have failed to exit from identify .
20If you like the significance stars, you may need to set options(useFancyQuotes=FALSE) to get the legend
about the stars to print correctly in some cases, for example, in a LATEX document.
21Grid lines can be added to most plots by first drawing using the plot function and then using the grid
function to add the grid lines. In car graphics functions, we use grid(lty=1) to get solid-line grids rather
than the dotted lines that are the default.
22The generic print function is invoked implicitly and automatically when an object is printed, for
example, by typing the name of the object at the R command prompt, or in the event that the object returned
by a function isn’t assigned to a variable. The print function can also be called explicitly, however.
23Indeed, everything in R that is returned by a function is an object, but some functions have side effects
that create nonobjects, such as files and graphs.
24More information on the S3 and S4 object systems is provided in Section 8.7.
25For example, the summary method summary.loess is hidden in the namespace of the stats package; to
call this function directly to summarize an object of class "loess" , we could reference the function with
the nonintuitive name stats:::summary=loess .
26A disclaimer: We are not impartial, because one of us (Fox, 2005b) wrote the Rcmdr package and the
other one insisted that it at least be mentioned in this chapter.
Reading and Manipulating Data2
There are often many ways to accomplish a task in R. For common tasks
such as reading data into an R data frame from a text file, we will
generally explain a few good ways to proceed rather than aiming at an
exhaustive treatment.
We limit the presentation to those aspects of the R language that are most
useful to practicing data analysts. For example, we avoid a fully general
exposition of R modes and classes.
Many functions in R, including those in the car package, have a dizzying
array of options. We will generally only describe the most important
options, or, in some cases, none of the options. As you gain experience
with R you can learn about the options by reading the help pages that are
included in R packages.
We suggest that users of R adopt conventions that will facilitate their
work and minimize confusion, even when the R language does not
enforce these conventions. For example, in the car package we begin the
names of data frames with uppercase letters and the names of variables
in the data frames with lowercase letters.
Section 2.1 describes how to read data into R variables and data frames.
Section 2.2 explains how to work with data stored in data frames. Section 2.3
introduces matrices, higher-dimensional arrays, and lists. Section 2.4 explains
how to manipulate character data in R. Section 2.5 discusses how to handle large
data sets in R. Finally, Section 2.6 deals more abstractly with the organization of
data in R, explaining the notions of classes, modes, and attributes, and describing
the problems that can arise with floating-point calculations.
[1] 1 2 3 4
Entering data in this manner works well for very short vectors. You may have
noticed in some of the previous examples that when an R statement is continued
on additional lines, the > (greater than) prompt is replaced by the interpreter with
the + (plus) prompt on the continuation lines. R recognizes that a line is to be
continued when it is syntactically incomplete— for example, when a left
parenthesis needs to be balanced by a right parenthesis or when the right
argument to a binary operator, such as * (multiplication), has not yet been
entered. Consequently, entries using c may be continued over several lines
simply by omitting the terminal right parenthesis until the data are complete. It
may be more convenient, however, to use the scan function, to be illustrated
shortly, which prompts with the index of the next entry.
Consider the data in Table 2.1. The data are from an experiment (Fox and
Guyer, 1978) in which each of 20 four-person groups of subjects played 30 trials
of a prisoners’ dilemma game in which subjects could make either cooperative or
competitive choices. Half the groups were composed of women and half of men.
Half the groups of each sex were randomly assigned to a public-choice
condition, in which the choices of all the individuals were made known to the
group after each trial, and the other groups were assigned to an anonymous-
choice condition, in which only the aggregated choices were revealed. The data
in the table give the number of cooperative choices made in each group, out of
30 × 4 = 120 choices in all.
To enter the number of cooperative choices as a vector, we could use the c
function, typing the data values separated by commas, but instead we will
illustrate the use of the scan function:
> (cooperation <− scan())
1: 49 64 37 52 68 54
7: 61 79 64 29
11: 27 58 52 41 30 40 39
18: 44 34 44
21:
Read 20 items
[1] 49 64 37 52 68 54 61 79 64 29 27 58 52 41 30 40 39 44 34 44
The number before the colon on each input line is the index of the next
observation to be entered and is supplied by scan ; entering a blank line
terminates scan . We entered the data for the Male, Public-Choice treatment
first, followed by the data for the Female, Public-Choice treatment, and so on.
We could enter the condition and sex of each group in a similar manner, but
because the data are patterned, it is more economical to use the rep (replicate)
function. The first argument to rep specifies the data to be repeated; the second
argument specifies the number of repetitions:
> rep(5, 3)
[1] 5 5 5
[1] 1 2 3 1 2 3
When the first argument to rep is a vector, the second argument can be a vector
of the same length, specifying the number of times each entry of the first
argument is to be repeated:
> rep(1:3, 3:1)
[1] 1 1 1 2 2 3
The vector sex requires using rep twice, first to generate five "male" character
strings followed by five "female" character strings, and then to repeat this
pattern twice to get all 20 values.
Finally, it is convenient to put the three variables together in a data frame:
> (Guyer <− data.frame(cooperation, condition, sex))
The original variables condition and sex are character vectors. When vectors
of character strings are put into a data frame, they are converted by default into
factors, which is almost always appropriate for subsequent statistical analysis of
the data. (The important distinction between character vectors and factors is
discussed in Section 2.2.4.)
R has a bare-bones, spreadsheet-like data editor that may be used to enter,
examine, and modify data frames. We find this editor useful primarily for
modifying individual values—for example, to fix an error in the data. If you
prefer to enter data in a spreadsheet-like environment, we suggest using one of
the popular and more general spreadsheet programs such as Excel or OpenOffice
Calc, and then importing your data into R.
To enter data into a new data frame using the editor, we may type the
following:
> Guyer <− edit(as.data.frame(NULL))
This command opens the data editor, into which we may type variable names
and data values. An existing data frame can be edited with the fix function, as in
fix(Guyer) .
The fix function can also be used to examine an existing data frame, but the
View function is safer and more convenient: safer because View cannot modify
the data, and more convenient because the View spreadsheet window can remain
open while we continue to work at the R command prompt. In contrast, the R
interpreter waits for fix or edit to terminate before returning control to the
command prompt.
The first line of the file gives the names of the variables separated by white
space consisting of one or more blanks or tabs; these names are valid R
variable names, and in particular must not contain blanks. If the first entry
in each line of the data file is to provide row names for the data frame, then
there is one fewer variable name than columns of data; otherwise, there is
one variable name for each column.
Each subsequent line contains data for one observation or case, with the
data values separated by white space. The data values need not appear in
the same place in each line as long as the number of values and their order
are the same in all lines. Character data either contain no embedded blanks
(our preference) or are enclosed in single or double quotes. Thus, for
example, white.collar , ’white collar’ , and "white collar" are valid
character data values, but white collar without quotes is not acceptable
and will be interpreted as two separate values, white and collar .
Character and logical data are converted to factors on input. You may avoid
this conversion by specifying the argument as.is=TRUE to read.table , but
representing categorical data as factors is generally desirable.
Many spreadsheet programs and other programs create plain-text files with
data values separated by commas—so-called comma-delimited or comma-
separated files. Supplying the argument sep="," to read.-table
accommodates this form of data. Alternatively, the function read.csv , a
convenience interface to read.table that sets header=TRUE and sep="," by
default, may be used to read comma-separated data files. In comma-
delimited data, blanks may be included in unquoted character strings, but
commas may not be included.
Missing data appear explicitly, preferably encoded by the characters NA (Not
Available); in particular, missing data are not left blank. There is, therefore,
the same number of data values in each line of the input file, even if some
of the values represent missing data. In a comma-separated file, however,
missing values may be left blank. If characters other than NA are used to
encode missing data, and if it is inconvenient to replace them in an editor,
then you may specify the missing-data code in the na.strings argument to
read.table . For example, both SAS and SPSS recognize the period (. ) as
an input missing-data indicator; to read a file with periods encoding missing
data, use na.strings="." . Different missing-data codes can be supplied
for different variables by specifying na.strings as a vector. For more
details, see the online documentation for read.table .
This specification is more rigid than it needs to be, but it is clear and usually is
easy to satisfy. Most spreadsheet, database, and statistical programs are capable
of producing plain-text files of this format, or produce files that can be put in this
form with minimal editing. Use a plain-text editor (such as Windows Notepad or
a programming editor) to edit data files. If you use a word-processing program
(such as Word or OpenOffice Writer), be careful to save the file as a plain-text
file; read.table cannot read data saved in the default formats used by word-
processing programs.
We use the data in the file Prestige.txt to illustrate.1 This data set, with
occupations as observations, is similar to the Duncan occupational-prestige data
employed as an example in the previous chapter. Here are a few lines of the data
file (recall that the ellipses represent omitted lines—there are 102 occupations in
all):
The first argument to read.table specifies the location of the input file, in this
case the location in our local file system where we placed Prestige.txt. As we
will see, with an active Internet connection, it is also possible to read a data file
from a URL (web address). We suggest naming the data frame for the file from
which the data were read and to begin the name of the data frame with an
uppercase letter: Prestige .
Even though we are using R on a Windows system, the directories in the file
system are separated by a / (forward slash) rather than by the standard Windows
\ (back slash), because the back slash serves as a so-called escape character in an
R character string, indicating that the next character has a special meaning: For
example, \n represents a new-line character (i.e., go to the beginning of the next
line), while \t is the tab character. Such special characters can be useful in
creating printed output. A (single) back slash may be entered in a character string
as \\.
You can avoid having to specify the full path to a data file if you first tell R
where to look for data by specifying the working directory. For example,
setwd("D:/data") sets the working directory to D:\data, and the command
read.table("mydata.txt") would then look for the file mydata.txt in the
D:\data directory. On Windows or Mac OS X, the command
setwd(choose.dir()) allows the user to select the working directory
interactively.2 The command getwd() displays the working directory.
Under Windows or Mac OS X, you can also browse the file system to find the
file you want using the file.choose function:
> Prestige <− read.table(file.choose(), header=TRUE)
The file.choose function returns the path to the selected file as a character
string, which is then passed to read.table .
The second argument, header=TRUE , tells read.table that the first line in the
file contains variable names. It is not necessary to specify header= TRUE when,
as here, the initial variable-names line has one fewer entry than the data lines
that follow. The first entry on each data line in the file is an observation label.
Nevertheless, it does not hurt to specify header=TRUE , and getting into the habit
of doing so will save you trouble when you read a file with variable names but
without row names.
Error in scan(file = file, what = what, sep = sep, quote = quote, skip = 0, : line 34 did
not have 7 elements
Because of the error, the data frame Prestige has not been created, and the error
message tells us that at least one line in the file has the wrong number of
elements. We can use the count.fields function to discover whether there are
other errors as well, and, if there are, to determine their location:
> (counts <− count.fields(file))
> which(counts != 7)
[1] 35 54 64 68
Once we know the location of the errors, it is simple to fix the input file in a text
editor that keeps track of line numbers. Because the data file has column names
in its first line, the 34th data line is the 35th line in the file.
FIXED-FORMAT DATA
Although they are no longer common, you may encounter fixed-format data
files, in which the data values are not separated by white space or commas and in
which variables appear in fixed positions within each line of the file. To illustrate
the process of reading this kind of data, we have created a fixed-format version
of the Canadian occupational-prestige data set, which we placed in the file
Prestige-fixed.txt. The file looks like this:
gov.administrators 13.111235111.1668.81113prof
general.managers 12.2625879 4.0269.11130prof
accountants 12.77 927115.7063.41171prof
…
typesetters 10.00 646213.5842.29511bc
bookbinders 8.55 361770.8735.29517bc
The first 25 characters in each line are reserved for the occupation name, the
next five spaces for the education value, the next five for income , and so on.
Most of the data values run together, making the file difficult to decipher. If you
have a choice, fixed-format input is best avoided. The read.fwf (read fixed-
width- format) function can be used to read this file into a data frame:
Recall that everything to the right of the # (pound sign) is a comment and is
ignored by the R interpreter. For Excel 2007 files, use odbcConnect-Excel2007
in place of odbcConnectExcel .
The variable name "F1" was supplied automatically for the first column of the
spreadsheet. We prefer to use this column to provide row names for the
Prestige data frame:
1. A data set can be read into the global environment (i.e., working memory)
via the data command.5 For example, to read the Prestige data frame from
the car package into memory:
> data(Prestige, package="car")
> head(Prestige) # first 6 rows
Had the car package already been loaded, the package argument to data
could have been omitted.
2. Many packages allow data sets to be accessed directly, via the so-called
lazy-data mechanism, without explicitly reading the data into memory. For
example,
library(car)
> head(Prestige)
CLEANING UP
We have defined several variables in the course of this section, some of which
are no longer needed, so it is time to clean up:
> objects()
By default, row labels and variable names are included in the file, data values are
separated by blanks, and all character strings are in quotes, whether or not they
contain blanks. This default behavior can be changed—see ?write.table .
The foreign package also includes some functions for exporting R data to a
variety of file formats: Consult the documentation for the foreign package,
help(package="foreign") .
1. Attach the data frame to the search path via the attach command, making
the variables in the data frame directly visible to the R interpreter.
2. Access the variables in the data frame as they are required without attaching
the data frame.
Now if we type the name of the variable prestige at the command prompt, R
will look first in the global environment (.GlobalEnv ), the region of memory in
which R stores working data. If no variable named prestige is found in the
global environment, then the data frame Duncan will be searched, because it was
placed by the attach command in the second position on the search list. There is
in fact no variable named prestige in the working data, but there is a variable
by this name in the Duncan data frame, and so when we type prestige , we
retrieve the prestige variable from Duncan , as we may readily verify:
> prestige
> Duncan$prestige
Typing Duncan$prestige directly extracts the column named prestige from the
Duncan data frame.6
Had prestige not been found in Duncan , then the sequence of attached
packages would have been searched in the order shown, followed by a special
list of objects (Autoloads ) that are loaded automatically as needed (and which
we will subsequently ignore), and finally the R base package. The packages in
the search path shown above, beginning with the stats package, are part of the
basic R system and are loaded by default when R starts up.
Suppose, now, that we attach the Prestige data frame to the search path. The
default behavior of the attach function is to attach a data frame in the second
position on the search path, after the global environment:
> attach(Prestige)
women
> search()
Consequently, the data frame Prestige is attached before the data frame Duncan
; and if we now simply type prestige , then the prestige variable in Prestige
will be located before the prestige variable in Duncan is encountered:
> prestige
Calling detach with no arguments detaches the second entry in the search path
and, thus, produces the same effect as detach(Prestige) .
Now that Prestige has been detached, prestige again refers to the variable
by that name in the Duncan data frame:
> prestige
The working data are the first item in the search path, and so globally defined
variables shadow variables with the same names anywhere else along the path.
This is why we use an uppercase letter at the beginning of the name of a data
frame. Had we, for example, named the data frame prestige rather than
Prestige , then the variable prestige within the data frame would have been
shadowed by the data frame itself. To access the variable would then require a
potentially confusing expression, such as prestige$prestige .
Our focus here is on manipulating data, but it is worth mentioning that R
locates functions in the same way that it locates data. Consequently, functions
earlier on the path can shadow functions of the same name later on the path.
In Section 1.1.3, we defined a function called myMean , avoiding the name
mean so that the mean function in the base package would not be shadowed. The
base function mean can calculate trimmed means as well as the ordinary
arithmetic mean; for example,
> mean(prestige)
[1] 47.68889
> mean(prestige, trim=0.1)
[1] 47.2973
Specifying mean(prestige, trim=0.1) removes the top and bottom 10% of the
data, calculating the mean of the middle 80% of observations. Trimmed means
provide more efficient estimates of the center of a heavy-tailed distribution—for
example, when outliers are present; in this example, however, trimming makes
little difference.
Suppose that we define our own mean function, making no provision for
trimming:
> mean <− function(x){ + warning("the mean function in the base package is shadowed") +
sum(x)/length(x) + }
The first line in our mean function prints a warning message. The purpose of the
warning is simply to verify that our function executes in place of the mean
function in the base package. Had we carelessly shadowed the standard mean
function, we would not have politely provided a warning:
> mean(prestige)
[1] 47.68889
Warning message:
In mean(prestige):the mean function in the base package is shadowed
The essential point here is that because our mean function resides in the global
environment, it is encountered on the search path before the mean function in the
base package. Shadowing the standard mean function is inconsequential as long
as our function is equivalent; but if, for example, we try to calculate a trimmed
mean, our function does not work:
> mean(prestige, trim=0.1)
[1] 47.68889
Specifying mean <− mean(prestige) causes our mean function to calculate the
mean prestige and then stores the result in a variable called mean , which has
the effect of destroying our mean function (and good riddance to it). The variable
mean in the working data does not, however, shadow the function mean in the
base package:
> mean(prestige, trim=0.1)
[1] 47.2973
Variables in attached data frames may mask other objects, and variables in
attached data frames themselves may be masked by objects of the same
name—for example, in the global environment.
Attaching a data frame makes a copy of the data frame; the attached version
is a snapshot of the data frame at the moment when it is attached. If changes
are made to the data frame, these are not reflected in the attached version of
the data. Consequently, after making such a change, it is necessary to detach
and reattach the data frame. We find this procedure awkward, and
inexperienced users of R may not remember to detach and reattach the data,
leading to confusion about the current state of the attached data.
We have observed that new users of R tend not to detach data frames after
they are done with them. Often they will attach multiple versions of a data
frame in the same session, which potentially results in confusion.
There are several strategies that we can use to avoid attaching a data frame:
[1] 47.68889
[1] 47.68889
Call:
lm(formula = prestige ~ income + education)
Coefficients:
There are relatively profound statistical issues concerning how best to use
available information when missing data are encountered (see, e.g., Little
and Rubin, 2002; and Schafer, 1997). We will ignore these issues here,
except to remark that R is well designed to make use of sophisticated
approaches to missing data.7
There are intellectually trivial but often practically vexing mechanical
issues concerning computing with missing data in R. These issues, which
are the subject of the present section, arise partly because of the diverse
data structures and kinds of functions available simultaneously to the R
user. Similar issues arise in all statistical software, however, although they
may sometimes be disguised.
[1] 110 4
These data, on 110 U.S. metropolitan areas, were originally from the 1970
Statistical Abstract of the United States and were used by Freedman (1975) as
part of a wide-ranging study of the social and psychological effects of crowding.
Freedman argues, by the way, that high density tends to intensify social
interaction, and thus the effects of crowding are not simply negative. The
variables in the data set are as follows:
Suppose, now, that we try to calculate the median density ; as we will see
shortly, the density values are highly positively skewed, so using the mean as a
measure of the center of the distribution would be a bad idea:
> median(Freedman$density)
[1] NA
R tells us that the median density is missing. This is the pedantically correct
answer: Several of the density values are missing, and consequently we cannot
in the absence of those values know the median, but this is probably not what we
had in mind when we asked for the median density .By setting the na.rm (NA -
remove) argument of median to TRUE , we instruct R to calculate the median of
the remaining, nonmissing values:
> median(Freedman$density, na.rm=TRUE)
[1] 412
Several other R functions that calculate statistical summaries, such as mean , var
(variance), sd (standard deviation), and quantile (quantiles), also have na.rm
arguments, but not all R functions handle missing data in this manner.
Most plotting functions simply ignore missing data. For example, to construct
a scatterplot of crime against density , including only the observations with
valid data for both variables, we enter
> with(Freedman, {
+ plot(density, crime)
+ identify(density, crime, row.names(Freedman))
+ })
Figure 2.1 Scatterplot of crime by population density for Freedman’s data. (a) Original density scale, with
a few high-density cities identified interactively with the mouse, and (b) log-density scale, showing linear
least-squares (broken) and lowess nonparametric-regression (solid) lines. Cases with one or both values
missing are silently omitted from both graphs.
The resulting graph, including several observations identified with the mouse,
appears in Figure 2.1a. Recall that we identify observations by pointing at them
with the mouse and clicking the left mouse button; exit from identify by
pressing the esc key in Mac OS X or, in Windows, by clicking the right mouse
button and selecting Stop. It is apparent that density is highly positively
skewed, making the plot very difficult to read. We would like to try plotting
crime against the log of density but wonder whether the missing data will spoil
the computation.8 The log function in R behaves sensibly, however: The result
has a missing entry wherever—and only where—there was a missing entry in the
argument:
> log(c(1, 10, NA, 100), base=10)
[1] 0 1 NA 2
This graph is much easier to read, and it now appears that there is a weak,
positive relationship between crime and density . We will address momentarily
how to produce the lines in the plot.
Statistical-modeling functions in R have a special argument, na.action ,
which specifies how missing data are to be handled; na.action is set to a
function that takes a data frame as an argument and returns a similar data frame
composed entirely of valid data (see Section 4.8.5). The default na.action is
na.omit , which removes all observations with missing data on any variable in
the computation. All the examples in this Companion use na.omit . An
alternative, for example, would be to supply an na.action that imputes the
missing values.
The prototypical statistical-modeling function in R is lm , which is described
extensively in Chapter 4. For example, to fit a linear regression of crime on the
log of density , removing observations with missing data on either crime or
density , enter the command
> lm(crime ~ log(density, base=10), data=Freedman)
Call:
lm(formula = crime ~ log(density, base = 10), data = Freedman)
Coefficients:
(Intercept) log(density, base = 10)
1297.3 542.6
The lm function returns a linear-model object; because the returned object was
not saved in a variable, the interpreter simply printed a brief report of the
regression. To plot the least-squares line on the scatterplot in Figure 2.1:
> abline(lm(crime ~ log(density, base=10), data=Freedman), lty="dashed")
We then use good to select the valid observations by indexing (a topic described
in Section 2.3.4). For example, it is convenient to use the lowess function to add
a nonparametric-regression smooth to our scatterplot (Figure 2.1b), but lowess
makes no provision for missing data:10
> with(Freedman,
+ lines(lowess(log(density[good], base=10), crime[good], f=1.0)))
By indexing the predictor density and response crime with the logical vector
good , we extract only the observations that have valid data for both variables.
The argument f to the lowess function specifies the span of the lowess smoother
—that is, the fraction of the data included in each local-regression fit; large
spans (such as the value 1.0 employed here) produce smooth regression curves.
Suppose, as is frequently the case, that we analyze a data set with a complex
pattern of missing data, fitting several statistical models to the data. If the
models do not all use exactly the same variables, then it is likely that they will be
fit to different subsets of nonmissing observations. Then if we compare the
models with a likelihood ratio test, for example, the comparison will be
invalid.11
To avoid this problem, we can first use na.omit to filter the data frame for
missing data, including all the variables that we intend to use in our data
analysis. For example, for Freedman’s data, we may proceed as follows,
assuming that we want subsequently to use all four variables in the data frame:
> Freedman.good <− na.omit(Freedman)
> head(Freedman.good) # first 6 rows
A note of caution: Filtering for missing data on variables that we do not intend
to use can result in discarding data unnecessarily. We have seen cases where
students and researchers inadvertently and needlessly threw away most of their
data by filtering an entire data set for missing values, even when they intended to
use only a few variables in the data set.
Finally, a few words about testing for missing data in R: A common error is to
assume that one can check for missing data using the == (equals) operator, as in
> NA == c(1, 2, NA, 4)
[1] NA NA NA NA
For example, to count the number of missing values in the Freedman data frame:
> sum(is.na(Freedman))
[1] 20
This command relies on the automatic coercion of the logical values TRUE and
FALSE to one and zero, respectively.
Near the beginning of this chapter, we entered data from Fox and Guyer’s
(1978) experiment on anonymity and cooperation into the global variables
cooperation , condition , and sex .12 The latter two variables are character
vectors, as we verify for condition :
> condition
We can confirm that this is a vector of character values using the predicate
function is.character , which tests whether its argument is of mode
"character" :
> is.character(condition)
[1] TRUE
After entering the data, we defined the data frame Guyer , which also contains
variables named cooperation , condition , and sex . We will remove the global
variables and will work instead with the data frame:
> remove(cooperation, condition, sex)
> is.character(Guyer$condition)
[1] FALSE
> is.factor(Guyer$condition)
[1] TRUE
The new variable perc.coop resides in the working data, not in the Guyer data
frame. It is generally advantageous to add new variables such as this to the data
frame from which they originate: Keeping related variables together in a data
frame avoids confusion, for example.
> Guyer$perc.coop <− 100*Guyer$cooperation/120
> head(Guyer) # first 6 rows
A similar procedure may be used to modify an existing variable in a data
frame. The following command, for example, replaces the original cooperation
variable in Guyer with the logit (log-odds) of cooperation:
> Guyer$cooperation <− with(Guyer, log(perc.coop/(100 − perc.coop)))
> head(Guyer)
The transform function can be used to create and modify several variables in
a data frame at once. For example, if we have a data frame called Data with
variables named a , b , and c , then the command
> Data <− transform(Data, c=-c, asq=a^2, a.over.b=a/b)
replaces Data by a new data frame in which the variables a and b are included
unchanged, c is replaced by -c , and two new variables are added—asq , with
the squares of a , and a.over.b , with the ratios of a to b .
Transforming numeric data is usually a straightforward operation—simply
using mathematical operators and functions. Categorizing numeric data and
recoding categorical variables are often more complicated matters. Several
functions in R are employed to create factors from numeric data and to
manipulate categorical data, but we will limit our discussion to three that we find
particularly useful: (1) the standard R function cut , (2) the recode function in
the car package, and (3) the standard ifelse function.
The cut function dissects the range of a numeric variable into class intervals,
or bins. The first argument to the function is the variable to be binned; the
second argument gives either the number of equal-width bins or a vector of cut
points at which the division is to take place. For example, to divide the range of
perc.coop into four equal-width bins, we specify
R responds by creating a factor, the levels of which are named for the end points
of the bins. In the example above, the first level includes all values with
Guyer$perc.coop greater than 22.5 (which is slightly smaller than the minimum
value of Guyer$perc.coop ) and less than or equal to 33.3, the cut point between
the first two levels. Because perc.coop is not uniformly distributed across its
range, the several levels of coop.4 contain different numbers of observations.
The output from the summary function applied to a factor gives a one-
dimensional table of the number of observations in each level of the factor.
Suppose, alternatively, that we want to bin perc.coop into three levels
containing roughly equal numbers of observations14 and to name these levels
"low" , "med" , and "high" ; we may proceed as follows:
The quantile function is used to locate the cut points. Had we wished to divide
perc.coop into four groups, for example, we would simply have specified
different quantiles, c(0, .25, .5, .75, 1) , and of course supplied four values
for the labels argument.
The recode function in the car package, which is more flexible than cut , can
also be used to dissect a quantitative variable into class intervals: for example,
> (Guyer$coop.2 <− recode(Guyer$perc.coop, "lo:50=1; 50:hi=2"))
[1] 1 2 1 1 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1
The sample function is used to pick a random sample of 20 rows in the data
frame, selecting 20 random numbers without replacement from one to the
number of rows in Womenlf ; the numbers are placed in ascending order by the
sort function.15
We use the set.seed function to specify the seed for R’s pseudo-random
number generator, ensuring that if we repeat the sample command, we will
obtain the same sequence of pseudo-random numbers. Otherwise, the seed of the
random-number generator will be selected unpredictably based on the system
clock when the first random number is generated in an R session. Setting the
random seed to a known value before a random simulation makes the result of
the simulation reproducible. In serious work, we generally prefer to start with a
known but randomly selected seed, as follows:
> (seed <− sample(2^31 − 1, 1))
[1] 974373618
> set.seed(seed)
The number 231 − 1 is the largest integer representable as a 32-bit binary number
on most of the computer systems on which R runs (see Section 2.6.2).
The data in Womenlf originate from a social survey of the Canadian population
conducted in 1977 and pertain to married women between the ages of 21 and 30,
with the variables defined as follows:
The first two examples yield identical results, with the second example
illustrating the use of else . To verify that all the values in working and
working.alt are the same, we use the all function along with the element-
wise comparison operator == (equals).
In the third example, the factor fulltime is created, indicating whether a
woman who works outside the home works full-time or part-time; fulltime
is NA (missing) for women who do not work outside the home.
The fourth and final example illustrates how values that are not recoded
(here Atlantic , Quebec , and Ontario in the factor region )are simply
carried over to the result.
The standard R ifelse command (discussed further in Section 8.3.1) can also
be used to recode data. For example,
> Womenlf$working.alt.2 <− factor(with(Womenlf,
+ ifelse(partic %in% c("parttime", "fulltime"), "yes", "no")))
> with(Womenlf, all.equal(working, working.alt.2))
[1] TRUE
[1] TRUE
The first argument to ifelse is a logical vector, containing the values TRUE and
FALSE ; the second argument gives the value assigned where the first argument is
TRUE ; and the third argument gives the value assigned when the first argument is
FALSE . We used cascading ifelse commands to create the variable
fulltime.alt , assigning the value "yes" to those working "fulltime" , "no"
to those working "parttime" , and NA otherwise (i.e., where partic takes on the
value "not.work" ).
We employed the matching operator, %in% , which returns a logical vector
containing TRUE if a value in the vector before %in% is a member of the vector
after the symbol and FALSE otherwise. See help("%in%") for more information;
the quotes are required because of the % character. We also used the function
all.equal to test the equality of the alternative recodings. When applied to
numeric variables, all.equal tests for approximate equality, within the
precision of floating-point computations (discussed in Section 2.6.2); more
generally, all.equal not only reports whether two objects are approximately
equal but, if they are not equal, provides information on how they differ.
An alternative to the first test is all(working == working.alt.2) , but this
approach won’t work properly in the second test because of the missing data:
> with(Womenlf, all(fulltime == fulltime.alt))
[1] NA
We once more clean up before proceeding by removing the copy of Womenlf that
was made in the working data:
> remove(Womenlf)
2.3.1 MATRICES
A matrix in R is a two-dimensional array of elements all of which are of the
same mode—for example, real numbers, integers, character strings, or logical
values. Matrices can be constructed using the matrix function, which reshapes
its first argument into a matrix with the specified number of rows (the second
argument) and columns (the third argument): for example,
> (A <− matrix(1:12, nrow=3, ncol=4))
A matrix is filled by columns, unless the optional argument byrow is set to TRUE .
The second example illustrates that if there are fewer elements in the first
argument than are required, then the elements are simply recycled, extended by
repetition to the required length.
A defining characteristic of a matrix is that it has a dim (d imension) attribute
with two elements: the number of rows and the number of columns: 16
> dim(A)
[1] 3 4
> dim(B)
[1] 4 3
As we have seen before, a vector is a one-dimensional array of numbers. For
example, here is a vector containing a random permutation of the first 10
integers:
> set.seed(54321) # for reproducibility
> (v <− sample(10, 10)) # permutation of 1 to 10
[1] 5 10 2 8 7 9 1 4 3 6
A vector has a length attribute but not a dim attribute:
> length(v)
[1] 10
> dim(v)
NULL
R often treats vectors differently than matrices. You can turn a vector into a one-
column matrix using the as.matrix coercion function:
> as.matrix(v)
[,1]
[1,] 5
[2,] 10
[3,] 2
[4,] 8
[5,] 7
[6,] 9
[7,] 1
[8,] 4
[9,] 3
[10,] 6
2.3.2 ARRAYS
Higher-dimensional arrays of homogeneous elements are encountered much
less frequently than matrices. If needed, higher-dimensional arrays may be
created with the array function; here is an example generating a three-
dimensional array:
> (array.3 <− array(1:24, dim=c(4, 3, 2))) # 4 rows, 3 columns, 2 layers
,,1
,,2
The order of the dimensions is row, column, and layer. The array is filled with
the index of the first dimension changing most quickly—that is, row, then
column, then layer.
2.3.3 LISTS
Lists are data structures composed of potentially heterogeneous elements. The
elements of a list may be complex data structures, including other lists. Because
a list can contain other lists as elements, each of which can also contain lists,
lists are recursive structures. In contrast, the elements of an ordinary vector—
such as an individual number, character string, or logical value—are atomic
objects.
Here is an example of a list, constructed with the list function:
> (list.1 <− list(mat.1=A, mat.2=B, vec=v)) # a 3-item list
This list contains a numeric matrix, a character matrix, and a numeric vector. We
named the arguments in the call to the list function; these are arbitrary names
that we chose, not standard arguments to list . The argument names supplied
became the names of the list elements.
Because lists permit us to collect related information regardless of its form,
they provide the foundation for the class-based S3 object system in R.17 Data
frames, for example, are lists with some special properties that permit them to
behave somewhat like matrices.
2.3.4 INDEXING
A common operation in R is to extract some of the elements of a vector,
matrix, array, list, or data frame by supplying the indices of the elements to be
extracted. Indices are specified between square brackets, “[ ” and “] ”. We have
already used this syntax on several occasions, and it is now time to consider
indexing more systematically.
INDEXING VECTORS
A vector can be indexed by a single number or by a vector of numbers;
indeed, indices may be specified out of order, and an index may be repeated to
extract the corresponding element more than once:
> v
[1] 5 10 2 8 7 9 1 4 3 6
> v[2]
[1] 10
[1] 8 10 9
[1] 8 10 8
[1] 5 2 7 1 3
If a vector has a names attribute, then we can also index the elements by
name:18
> names(v) <− letters[1:10]
> v
a b c d e f g h i j
5 10 2 8 7 9 1 4 3 6
f i g
9 3 1
a c g h i
5 2 1 4 3
Any of these forms of indexing may be used on the left-hand side of the
assignment operator to replace the elements of a vector—an unusual and
convenient feature of the R language: for example,
> (vv <− v) # make copy of v
a b c d e f g h i j
5 10 2 8 7 9 1 4 3 6
a b c d e f g h i j
1 10 2 8 3 9 1 4 3 6
a b c d e f g h i j
1 NA 2 NA 3 NA 1 NA 3 NA
> remove(vv)
[1] 8
[1] 4 5
The second example above, A[2, 3] , returns a single-element vector rather than
a 1 × 1 matrix; likewise, the third example, A[c(1, 2), 2] , returns a vector
with two elements rather than a 2 × 1 matrix. More generally, in indexing a
matrix or array, dimensions of extent one are automatically dropped. In
particular, if we select elements in a single row or single column of a matrix,
then the result is a vector, not a matrix with a single row or column, a convention
that will occasionally give an R programmer headaches. We can override this
default behavior with the argument drop=FALSE :
> A[,1]
[1] 1 2 3
> A[ , 1, drop=FALSE]
[,1]
[1,] 1
[2,] 2
[3,] 3
In both of these examples, the row index is missing and is therefore taken to be
all rows of the matrix.
Negative indices, row or column names (if they are defined), and logical
vectors of the appropriate length may also be used to index a matrix or a higher-
dimensional array:
> A[ , -c(1, 3)] # omit columns 1 and 3
Used on the left of the assignment arrow, we may replace indexed elements in
a matrix or array:
> (AA <− A) # make a copy of A
> remove(AA)
INDEXING LISTS
Lists may be indexed in much the same way as vectors, but some special
considerations apply. Recall the list that we constructed earlier:
> list.1[c(2, 3)] # elements 2 and 3
$mat.2
$vec
[1] 5 10 2 8 7 9 1 4 3 6
$mat.2
Even when we select a single element of the list, as in the last example, we get a
single-element list rather than (in this case) a matrix. To extract the matrix in
position 2 of the list, we can use double-bracket notation:
> list.1[[2]] # returns a matrix
The distinction between a one-element list and the element itself is subtle but
important, and it can trip us up if we are not careful.
If the list elements are named, then we can use the names in indexing the list:
> list.1["mat.1"] # produces a one-element list
$mat.1
Used on the left-hand side of the assignment arrow, dollar-sign indexing allows
us to replace list elements, define new elements, or delete an element (by
assigning NULL to the element):
$title
NULL
[1] 2
> list.1[["vec"]][3:5]
[1] 2 8 7
NULL
Because no row names were specified when we entered the data, the row names
are simply the character representation of the row numbers. Indexing Guyer as a
matrix:
We require with in the last example to access the variables sex and condition ,
because the data frame Guyer is not attached to the search path. More
conveniently, we can use the subset function to perform this operation:
> subset(Guyer, sex == "female" & condition == "public")
cooperation
1 -0.3708596
2 0.1335314
3 -0.8079227
4 -0.2682640
5 0.2682640
6 -0.2006707
Specifying Guyer["cooperation"] returns a one-column data frame rather than
a vector.
As has become our habit, we clean up before continuing:
> remove(A, B, v, array.3, list.1)
2.4 Manipulating Character Data
One of the most underappreciated capabilities of R is its facility in handling text.
Indeed, for many applications, R is a viable alternative to specialized text-
processing tools, such as the PERL scripting language and the Unix utilities sed,
grep, and awk. Most of the text-processing functions in R make use of so-called
regular expressions for matching text in character strings. In this section, we
provide a brief introduction to manipulating character data in R, primarily by
example. More complete information may be found in the online help for the
various text-processing functions; in ?regexp , which describes how regular
expressions are implemented in R; and in the sources cited at the end of the
chapter.
We’ll turn to the familiar “To Be or Not To Be” soliloquy from Shakespeare’s
Hamlet, in the plain-text file Hamlet.txt, as a source of examples. We begin by
using the readLines function to read the lines of the file into a character vector,
one line per element:
> file <− "http://socserv.socsci.mcmaster.ca/jfox/ books/Companion/data/Hamlet.txt"
> Hamlet <− readLines(file)
[1] 35
[1] 42 41 44 42 43 37 46 42 40 50 47 43 39 36 48 49 44 38 41 38
[21] 43 38 44 41 37 43 39 44 37 47 40 42 44 39 26
[1] 1454
The length function counts the number of character strings in the character
vector Hamlet —that is, the number of lines in the soliloquy—while the nchar
function counts the number of characters in each string—that is, in each line.
The paste function is useful for joining character strings into a single string.
For example, to join the first six lines:
(lines.1_6 <− paste(Hamlet[1:6], collapse=" "))
[1] "To be, or not to be: that is the question: … to say we end"
Here, and elsewhere in this section, we’ve edited the R output where necessary
so that it fits properly on the page. Alternatively, we can use the strwrap
function to wrap the text (though this once again divides it into lines),
> strwrap(lines.1_6)
[1] "To be, or not to be: that is the question: Whether ’tis"
[2] "nobler in the mind to suffer The slings and arrows of"
[3] "outrageous fortune, Or to take arms against a sea of"
[4] "troubles, And by opposing end them? To die: to sleep; No"
[5] "more; and by a sleep to say we end"
and the substring function, as its name implies, to select parts of a character
string—for example, to select the characters 1 through 42:
> substring(lines.1_6, 1, 42)
[1] "To be, or not to be: that is the question: … end them"
[2] "To die: to sleep; No more; and by a sleep to say we end"
And we can divide the text into individual characters by splitting at the empty
string, "" :
> characters <− strsplit(lines.1_6, "")[[1]]
> length(characters) # number of characters
[1] 254
[1] "T" "o" " " "b" "e" "," " " "o" "r" " " "n" "o" "t" " " "t"
[16] "o" " " "b" "e" ":"
Let us turn now to the whole soliloquy, dividing the text into words at spaces
(a strategy that, as we have seen, is flawed):
> all.lines <− paste(Hamlet, collapse=" ")
> words <− strsplit(all.lines, " ")[[1]]
> length(words) # number of words
[1] 277
The sub function takes three required arguments: (1) a regular expression
matching the text to be replaced, (2) the replacement text (here, the empty
string), and (3) a character vector in which the replacement is to be performed. If
the pattern in the regular expression matches more than one substring in an
element of the third argument, then only the first occurrence is replaced. The
gsub function behaves similarly, except that all occurrences of the pattern are
replaced:
> sub("me", "you", "It’s all, ’me, me, me’ with you!")
> gsub("me", "you", "It’s all, ’me, me, me’ with you!")
Returning to the soliloquy, suppose that we want to determine and count the
different words that Shakespeare used. A first step is to use the tolower function
to change all the characters to lowercase, so that, for example, "the" and "The"
aren’t treated as distinct words:
> head(words <− tolower(words), 20) # first 20 words
[1] 167
We used the table command to obtain the word counts, unique to remove
duplicate words, and sort to order the words from the most to the least used and
to arrange the unique words in alphabetical order. The alphabetized words reveal
a problem, however: We’re treating the hyphen (“- ”) as if it were a word.
The function grep may be used to search character strings for a regular
expression, returning the indices of the strings in a character vector for which
there is a match. For our example,
> grep("-", words)
[1] 55 262
We found matches in two character strings: the valid, hyphenated word "heart-
ache" and the spurious word "-" . We would like to be able to differentiate
between the two, because we want to discard the latter from our vector of words
but retain the former. We can do so as follows:
> grep("^-", words)
[1] 262
[1] 2
[1] 1 2 3 4
Here, the hyphen before the closing bracket represents itself and will match a
minus sign.
Used after an opening square bracket, the meta-character “^ ” represents
negation, and so, for example, to select elements of the vector data that do not
contain any numerals, hyphens, or periods:
> data[grep("^[^0-9.-]*$", data)]
Parentheses are used for grouping in regular expressions, and the bar character
(| ) means or. To find all the articles in the soliloquy, for example:
> words[grep("^(the|a|an)$", words)]
To see why the parentheses are needed here, try omitting them.
What happens if we want to treat a meta-character as an ordinary character?
We have to escape the meta-character by using a back slash (\), and because the
back slash is the escape character for R as well as for regular expressions, we
must, somewhat awkwardly, double it: for example,
> grep("\\$", c("$100.00", "100 dollars"))
[1] 1
Cleaning up,
> remove(Hamlet, lines.1_6, characters, all.lines, word.counts, data)
2.5 Handling Large Data Sets in R*
R has a reputation in some quarters for choking on large data sets. This
reputation is only partly deserved. We will explain in this section why very large
data sets may pose a problem for R and suggest some strategies for dealing with
such data sets.
The most straightforward way to write functions in R is to access data that
reside in the computer’s main memory. This is true, for example, of the
statistical-modeling functions in the standard R distribution, such as the lm
function for fitting linear models (discussed in Chapter 4) and the glm function
for fitting generalized linear models (Chapter 5). The size of computer memory
then becomes a limitation on the size of statistical analyses that can be handled
successfully.
A computer with a 32-bit operating system can’t address more than 4 GB
(gigabytes) of memory, and depending on the system, not all this memory may
be available to R.20 Computers with 64-bit operating systems can address vastly
larger amounts of memory. The current version of R is freely available in 64-bit
implementations for Linux, Mac OS X and Windows systems. As memory gets
cheaper and 64-bit systems become more common, analyzing very large data
sets directly in memory will become more practical.
Handling large data sets in R is partly a matter of programming technique.
Although it is convenient in R to store data in memory, it is not necessary to do
so. There are many R programs designed to handle very large data sets, such as
those used in genomic research, most notably the R packages distributed by
Bioconductor (at www.bioconductor.org ). Similarly, the biglm package on
CRAN has functions for fitting linear and generalized linear models that work
serially on chunks of the data rather than all the data at once. Moreover, some
packages, such as biglm and the survey package for the analysis of data from
complex sample surveys, are capable of accessing data in database management
systems (see Section 2.5.3).
Inefficient programming, such as unnecessarily looping over the observations
of a data set or allocating very large sparse matrices consisting mostly of zeros,
can waste computer time and memory.21
[1] 10.59
> memory.limit()
[1] 1535
Thus, we’ve used less than 10% of the memory available to R. Next, we
generate values of the response variable y , according to a linear-regression
model with normally distributed errors that have a standard deviation of 10:
> y <− 10 + as.vector(X %*% rep(1, 100) + rnorm(100000, sd=10))
In this expression, %*% is the matrix multiplication operator (see Section 8.2),
and the coercion function as.vector is used to coerce the result to a vector,
because matrix multiplication of a matrix by a vector in R returns a one-column
matrix. The vector of population regression coefficients consists of ones
—rep(1, 100) —and the regression intercept is 10.
To fit the regression model to the data,
> system.time(m <− lm(y ~ X))
yy
0 1
49978 50022
> object.size(D)
81611176 bytes
> write.table(D, "C:/temp/largeData.txt")
The data frame D consists of 100,000 rows and 102 columns, and uses about 80
MB of memory. To read the data back into memory from the ASCII file takes
about 3 minutes on our Windows Vista computer:
> system.time(DD <− read.table("C:/temp/largeData.txt", header=TRUE))
The read.table function is slow because it has to figure out whether data
should be read as numeric variables or as factors. To determine the class of each
variable, read.table reads all the data in character form and then examines each
column, converting the column to a numeric variable or a factor, as required. We
can make this process considerably more efficient by explicitly telling
read.table the class of each variable, via the col-Classes argument:
Reading the data now takes about 30 seconds. The "character" class specified
for the first column is for the row name in each line of the file created by
write.table ; for example, the name of the first row is "1" (with the quotation
marks included). For more details about specifying the col-Classes argument,
see ?read.table .
The save function allows us to store data frames and other R objects in a non-
human-readable format, which can be reread—using the load function— much
more quickly than an ASCII data file. For our example, the time to read the data
is reduced to only about 3 seconds:
> save(DD, file="C:/temp/DD.Rdata")
> remove(DD)
> system.time(load("C:/temp/DD.Rdata"))
> dim(DD)
A vector of integers:
> (x <− 1:5)
[1] 1 2 3 4 5
> length(x)
[1] 5
> class(x)
[1] "integer"
> mode(x)
[1] "numeric"
> length(y)
[1] 5
> class(y)
[1] "numeric"
> mode(y)
[1] "numeric"
A character vector:
> (cv <− c("Abel", "Baker", "Charlie"))
length(cv)
[1] 3
> class(cv)
[1] "character"
> mode(cv)
[1] "character"
A list:
> (lst <− list(x=x, y=y, cv=cv))
$x
[1] 1 2 3 4 5
$y
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
$cv
[1] "Abel" "Baker" "Charlie" >
length(lst)
[1] 3
> class(lst)
[1] "list"
> mode(lst)
[1] "list"
A matrix:
> (X <− cbind(x, y))
> length(X)
[1] 10
> class(X)
[1] "matrix"
> mode(X)
[1] "numeric"
A data frame:
> head(Duncan)
> length(Duncan)
[1] 4
> class(Duncan)
[1] "data.frame"
> mode(Duncan)
[1] "list"
A factor:
> Duncan$type
> length(Duncan$type)
[1] 45
> class(Duncan$type)
[1] "factor"
> mode(Duncan$type)
[1] "numeric"
A function:
> length(lm)
[1] 1
> class(lm)
[1] "function"
> mode(lm)
[1] "function"
Call:
> length(mod)
[1] 12
> class(mod)
[1] "lm"
> mode(mod)
[1] "list"
> attributes(X)
$dim
[1] 5 2
$dimnames
$dimnames[[1]]
NULL
$dimnames[[2]]
[1] "x" "y"
$names
[1] "type" "income" "education" "prestige"
$class
[1] "data.frame"
The output from str gives us a variety of information about the object: that it is
a data frame composed of four variables, including the factor type , with levels
"bc" , "wc" , and "prof" and initial values 2 2 2 2 2 2 2 2 3 2 ; and so on.23
Standard R functions exist to create data of different modes and for many
classes (constructor functions), to test for modes and classes (predicate
functions), and to convert data to a specific mode or class (coercion functions).
[1] 0 0 0 0 0
[1] a b c c b a
Levels: a b c
There is also the general constructor function vector , whose first argument
specifies the mode of the object to be created and whose second argument
specifies the object’s length :
> vector(mode="numeric", length=5)
[1] 0 0 0 0 0
[[1]]
NULL
[[2]]
NULL
[1] TRUE
> is.numeric(fac)
[1] FALSE
[1] TRUE
[1] TRUE
[1] FALSE
[1] 1 2 3 3 2 1
[1] 1 2 3 3 2 1
Levels: 1 2 3
> as.numeric(char)
[1] NA NA NA NA NA NA
Warning message:
NAs introduced by coercion
The last example illustrates that coercion may cause information to be lost.
There is also a general coercion function, as :
> as(fac, "character")
[1] 1 2 3 3 2 1
2.6.2 PITFALLS OF FLOATING-POINT ARITHMETIC
The designers of R have paid a great deal of attention to the numerical
accuracy of computations in the language, but they have not been able to repeal
the laws of computer arithmetic. We usually need not concern ourselves with the
details of how numbers are stored in R; occasionally, however, these details can
do us in if we’re not careful.
Integers—that is, the positive and negative whole numbers and zero— are
represented exactly in R. There are, however, qualifications. Computers
represent numbers in binary form, and on a typical computer, R uses 32 binary
digits (bits) to store an integer, which usually means that integers from −231 =
−2,147,483,648 to 231 − 1 = +2,147,483,647 can be represented. Smaller and
larger integers can’t be represented. Moreover, just because a number looks like
an integer doesn’t mean that it is represented as an integer in R:
> is.integer(2)
[1] FALSE
> is.integer(2L)
[1] TRUE
[1] TRUE
[1] TRUE
> is.integer(2L*3L)
[1] TRUE
> is.integer(4L/2L)
[1] FALSE
> is.integer(sqrt(4L))
[1] FALSE
Furthermore, R silently coerces integers to floating-point numbers (described
below), when both kinds of numbers appear in an arithmetic expression:
> is.integer(2L + 3)
[1] FALSE
[1] FALSE
> (sqrt(2))^2 − 2
[1] 4.440892e−16
[1] TRUE
Moreover, when objects are not essentially equal, all.equal tries to provide
useful information about how they differ:
> all.equal(2, 4)
R can also represent and compute with complex numbers, which occasionally
have statistical applications:
> (z <− complex(real=0, imaginary=1))
[1] 0+1i
[1] 2+2i
> w*z
[1] −2+2i
> w/z
[1] 2−2i
[1] Inf
> −1/0
[1] -Inf
> 0/0
[1] NaN
where Inf stands for ∞ (infinity) and NaN stands for “N ot aN umber.”
We do not bother to clean up at the end of the current chapter because we will
not save the R workspace. More generally, and as we mentioned in the Preface,
in this book we assume that each chapter represents an independent R session.
_____________________________
1You can download this file and other data files referenced in this chapter from the website for the book,
most conveniently with the carWeb function—see ?carWeb . The data sets are also available, with the same
names, in the car package.
2Alternatively, you can use the menus to select File → Change dir under Windows or Misc → Change
Working Directory under MacOSX.
3The character string specifying the URL for the file Prestige.txt is broken across two lines to fit on the
page but in fact must be given as one long string. Similar line breaks appear later in the chapter.
4You can download and install this package from CRAN in the usual manner with the command
install.packages("RODBC") or via the Packages menu. The RODBC package can also be made to work
on other operating systems but not as conveniently as under Windows.
5Environments in R are discussed in Section 8.9.1.
6Information on indexing data frames is presented in Section 2.3.4.
7Notable packages for handling missing data include amelia, mi, mice,and norm, which perform various
versions of multiple imputation of missing data.
8Transformations, including the log transformation, are the subject of Section 3.4.
9An alternative would have been to plot crime against density using a log axis for density :
plot(density, crime, log="x") . See Chapters 3 and 7 for general discussions of plotting data in R.
10The lowess function is described in Section 3.2.1.
11How statistical-modeling functions in R handle missing data is described in Section 4.8.5.
12Variables created by assignment at the command prompt are global variables defined in the working data.
13We did not have to create a new variable, say log.density <− log(density, 10) , as one may be
required to do in a typical statistical package such as SAS or SPSS.
14Roughly equal numbers of observations in the three bins are the best we can do because n = 20 is not
evenly divisible by 3.
15If the objective here were simply to sample 20 rows from Womenlf , then we could more simply use the
some function in the car package, some(Womenlf, 20) , but we will reuse this sample to check on the
results of our recodes.
16More correctly, a matrix is a vector with a two-element dim attribute.
17Classes are described in Sections 2.6 and 8.7.
18The vector letters contains the 26 lowercase letters from "a" to "z" ; LETTERS similarly contains the
uppercase letters.
19Using the hyphen to represent ranges of characters can be risky, because character ranges can vary from
one locale to another—say between English and French. Thus, for example, we cannot rely on the range a-
zA-Z to contain all the alphabetic characters that may appear in a word—it will miss the accented letters, for
example. As a consequence, there are special character classes defined for regular expressions, including
[:alpha:] , which matches all the alphabetic characters in the current locale.
20One gigabyte is a little more than 1 billion bytes, and 1 byte is 8 bits (i.e., binary digits); double-precision
floating-point numbers in R are stored in 8 bytes (64 bits) each. In a data set composed of floating-point
numbers, then, 4 GB corresponds to 4 × 10243/8, or somewhat more than 500 million data values.
21We describe how to avoid these problems in Sections 8.4.1 and 8.6.2.
22Objects also have a storage mode, but we will not make use of that.
23The str function is used extensively in Chapter 8 on programming in R.
24Integers in R are so-called long integers, occupying 32 bits, hence the L .
Exploring and Transforming Data3
Statistical graphs play three important roles in data analysis. Graphs provide
an initial look at the data, a step that is skipped at the peril of the data analyst.
At this stage, we learn about the data, its oddities, outliers, and other
interesting features. John Tukey (Tukey, 1977) coined the term exploratory
data analysis for this phase of an analysis. Graphs are also employed during
model building and model criticism, particularly in diagnostic methods used
to understand the fit of a model. Finally, presentation graphics can summarize
a fitted model for the benefit of others.
In the first two applications, we need to be able to draw many graphs
quickly and easily, while in the presentation phase we should be willing to
spend more time on a graph to get it just right for publication. In this chapter,
we present some basic tools for exploratory graphs, such as histograms,
boxplots, and scatterplots. Some of these tools are standard to R, while others
are in the car package associated with this book. We will return to regression
graphics in Chapter 6, with equally easy to use functions for various
diagnostic methods, which differ from the basic graphs of this chapter mostly
in the quantities that are graphed, not in the graphing paradigm. Finally, in
Chapter 7 we show how to produce customized, potentially elaborate, graphs
suitable for almost any purpose.
3.1.1 HISTOGRAMS
The most common graph of the distribution of a quantitative variable is the
histogram. A histogram dissects the range of the variable into class intervals,
called bins, and counts the number of observations falling in each bin. The
counts—or percentages, proportions, or densities calculated from the counts
—are plotted in a bar graph. An example, constructed by the following R
commands, appears in Figure 3.1:1
Figure 3.1 Default histogram of income in the Canadian occupational-prestige data.
> library(car)
> head(Prestige) # first 6 rows
The first of these commands loads the car package, giving us access to the
Prestige data. The second command displays the initial six lines of the data set.
The histogram is drawn by the hist function, in this case with no arguments
other than the variable to be plotted, income . The with command allows hist to
access income from the Prestige data frame (as explained in Section 2.2.2).
The default histogram, produced by hist with no extra arguments, has bins of
equal width, and the height of each bar is equal to the frequency—the number of
observations—in the corresponding bin. In an alternative definition of the
histogram, the height of each bar is selected so that its area is equal to the
fraction of the data in the corresponding bin. To distinguish it from the more
common frequency histogram, we call this latter graph a density histogram. The
hist function draws density histograms if the argument freq is set to FALSE or if
the breaks argument is used to define bins of unequal width.
Figure 3.2 Revised histogram of income .
The shape of the histogram is determined in part by the number of bins— too
few and the plot hides interesting features of the data, too many and the
histogram is too rough, displaying spurious features of the data. The default
method for selecting the number of bins, together with the effort to locate nice
cut points between the bins, can produce too few bins. An alternative rule,
proposed by Freedman and Diaconis (1981), sets the target number of bins to
where n is the number of observations, max − min is the range of the data, Q3
−Q1 is the interquartile range, and the ceiling brackets indicate rounding up to
the next integer. Applying this rule to income in the Canadian occupational-
prestige data produces the histogram in Figure 3.2:
> with(Prestige, hist(income, breaks="FD", col="gray"))
> box()
Setting col="gray" specifies the color of the histogram bars.2 The box function
draws a box around the histogram and could have been omitted. In this example,
both histograms suggest that the distribution of income has a single mode near
$5,000 and is skewed to the right, with several occupations that have relatively
large incomes.
As with most of the graphics functions in R, hist has a dizzying array of
arguments that can change the appearance of the graph:
> args(hist.default)
function (x, breaks = "Sturges", freq = NULL, probability = !freq, include.lowest = TRUE,
right = TRUE, density = NULL, angle = 45, col = NULL, border = NULL, main =
paste("Histogram of", xname), xlim = range(breaks), ylim = NULL, xlab = xname, ylab,
axes = TRUE, plot = TRUE, labels = FALSE, nclass = NULL, …)
The args command (Section 1.1.2) displays all the arguments of a function. We
asked for the arguments for hist.default rather than just hist because hist is
a generic function and it is the default method that actually draws the graph.3
While a more complete description is available from ?hist , here are some of
the key arguments. The breaks argument is used to specify the edges of the bins.
We can choose these values ourselves [e.g., breaks=c(0, 5000, 10000,
15000, 20000, 25000) ], give the number of bins we want (e.g., breaks=10 ),
or set breaks equal to the name of a rule that will determine the number of
equal-size bins (e.g., breaks ="FD" ). The possible settings are given on the help
page for the function. The xlab , ylab , and main arguments are used, as in most
graphical functions in R, to label the horizontal axis, vertical axis, and plot title,
respectively. If we don’t set these arguments, then hist will construct labels that
are often reasonable. The remaining arguments generally change the appearance
of the graph.
You may be familiar with stem-and-leaf displays, which are histograms that
encode the numeric data directly in their bars. We believe that stem-leaf-
displays, as opposed to more traditional histograms, are primarily useful for
what Tukey (1977) called scratching down numbers—that is, paper-and-pencil
methods for visualizing small data sets. That said, stem-and-leaf displays may be
constructed by the standard R function stem ; a more sophisticated version,
corresponding more closely to Tukey’s original stem-and-leaf display, is
provided by the stemleaf function in the aplpack package.
If you are looking for fancy three-dimensional effects and other chart junk (an
apt term coined by Tufte, 1983) that are often added by graphics programs to
clutter up histograms and other standard graphs, you will have to look
elsewhere: The basic R graphics functions intentionally avoid chart junk.
where (x) is the estimated density at the point x, the xi are the n observations on
the variable, and K is a kernel function—generally a symmetric, single-peaked
density function, such as the normal density. The quantity h is called the
bandwidth, and it controls the degree of smoothness of the density estimate: If h
is too large then the density estimate is smooth but biased as an estimator of the
true density, while if h is too small then bias is low but the estimate is too rough
—that is, the variance of the estimator is large.
The density function in R implements kernel-density estimation, by default
using a normal kernel and a reasonable method for selecting h to balance
variance and bias.4 Applying the density function to income in the Prestige
data:
> with(Prestige, {
+ hist(income, breaks="FD", freq=FALSE, ylab="Density")
+ lines(density(income), lwd=2)
+ lines(density(income, adjust=0.5), lwd=1)
+ rug(income)
+ box()
+ })
This example, which produces Figure 3.3, illustrates how an R graph can be built
up by successive calls to graphics functions. The hist function constructs the
histogram, with freq=FALSE to specify density scaling and ylab="Density"
furnishing the label for the vertical axis of the graph. The lines function draws
the density estimate on the graph, the coordinates of which are calculated by the
call to density . The argument lwd=2 draws a double-thick line. The second call
to density , with adjust=0.5 , specifies a bandwidth half the default value and
therefore produces a rougher density estimate, shown in the figure as a lighter
line, lwd=1 . The rug function is used to draw a one-dimensional scatterplot or
rug-plot at the bottom of the graph. The curly braces {} define a compound
command as the second argument to with , allowing us to specify several
commands that use the Prestige data. In this case, the default bandwidth
appears to do a good job of balancing detail against roughness, and using half
the default bandwidth produces a density estimate that is too rough.
As do most of the graphics functions in the car package, qqPlot supports both
interactive and automatic marking of extreme points. The labels argument is
used to provide labels to mark the points, here the row names of the Prestige
data frame. By setting id.n=3 , automatic point marking is turned on, and the
three most extreme points are labeled on the plot. The labels of the marked
points are returned in the R console.5
The qqPlot function can be used more generally to plot the data against any
reference distribution for which there are quantile and density functions in R,
which includes just about any distribution that you may wish to use. Simply
specify the root word for the distribution. For example, the root for the normal
distribution is norm , with density function dnorm and quantile function qnorm .
The root for the chi-square distribution is chisq , with density and quantile
functions dchisq and qchisq , respectively. Root words for some other
commonly used distributions are binom and pois for the binomial and Poisson
distributions, respectively (which, as discrete distribution, have probability-mass
functions rather than density functions), f for the F distribution, t for the t
distribution, and unif for the uniform distribution.
Figure 3.4 Normal quantile-comparison plot for income . The broken lines give a pointwise 95%
confidence envelope around the fitted solid line. Three points were labeled automatically. Because many
points, especially at the right of the graph, are outside the confidence bounds, we have evidence that the
distribution of income is not like a sample from a normal population.
Table 3.1 Arguments for some standard probability functions in R. Most of the arguments are self-
explanatory. For the binomial distribution, size represents the number of binomial trials, while prob
represents the probability of success on each trial. For the Poisson distribution, lambda is the mean. Not all
arguments are shown for all functions; consult the R help pages for details. Distribution Density or Mass
Function Quantile Function
In addition to density and quantile functions, R also provides cumulative
distribution functions, with the prefix p , and pseudo-random number generators,
with prefix r . For example, pnorm gives cumulative probabilities for the normal
distributions, while rnorm generates normal random variables. Table 3.1
summarizes the principal arguments to these probability functions.
To illustrate, we use the rchisq function to generate a random sample from
the chi-square distribution with 3 df and then plot the sample against the
distribution from which it was drawn, producing Figure 3.5:
Figure 3.5 Quantile-comparison plot of a sample of size n = 100 from the χ2(3) distribution against the
distribution from which the sample was drawn.
> set.seed(124) # for reproducibility
> qqPlot(rchisq(100, 3), distribution="chisq", df=3)
The points should, and do, closely match the straight line on the graph, with the
fit a bit worse for the larger values in the sample. The confidence envelope
suggests that these deviations for large values are to be expected, as they reflect
the long right tail of the χ2( 3) density function.
3.1.4 BOXPLOTS
The final univariate display that we describe is the boxplot. Although box-
plots are most commonly used to compare distributions among groups (as in
Section 3.2.2), they can also be drawn to summarize a single sample, providing a
quick check of symmetry and the presence of outliers. Figure 3.6 shows a
boxplot for income , produced by the Boxplot function in the car package:6
> Boxplot(~ income, data=Prestige)
The variable to be plotted is given in a one-sided formula: a tilde (~ ) followed
by the name of the variable. This variable is contained in the data frame
Prestige , and the data argument is used to tell the function where to find the
data. Most graphical functions that use a formula accept a data argument.
Figure 3.6 Boxplot of income . Several outlying observations were labeled automatically.
Figure 3.7 Simple scatterplot of prestige versus income for the Canadian occupational-prestige data.
3.2.1 SCATTERPLOTS
A scatterplot is the familiar graph of points with one quantitative variable on
the horizontal or x-axis and a second quantitative variable on the vertical or y-
axis. Understanding, and using, scatterplots is at the heart of regression analysis.
There is typically an asymmetric role of the two axes, with the y-axis reserved
for a response variable and the x-axis for a predictor.
The generic plot function is the primary tool in R for drawing graphs in two
dimensions. What this function produces depends on the values of its first one or
two arguments.7 If the first two arguments to plot are numeric vectors, then we
get a scatterplot, as in Figure 3.7:
> with(Prestige, plot(income, prestige))
The first argument to plot is the x-axis variable, and the second argument is
the y-axis variable. The scatterplot in Figure 3.7 is a summary graph for the
regression problem in which prestige is the response and income is the
predictor. As our eye moves from left to right across the graph, we see how the
distribution of prestige changes as income increases. In technical terms, we are
visualizing the conditional distributions of prestige given values of income .
The overall story here is that as income increases, so does prestige , at least up
to about $10,000, after which the value of prestige stays more or less fixed on
average at about 80.
We write E( prestige |income ) to represent the mean value of prestige as
the value of income varies and call this the conditional mean function or the
regression function. The qualitative statements in the previous paragraph
therefore concern the regression function. The variance function, Var( prestige
|income ), traces the conditional variability in prestige as income changes—that
is, the spread of y in vertical strips in the plot. As in Figure 3.7, when the tilt of a
scatterplot changes, it is difficult to judge changes in conditional variability from
a simple scatterplot.
Scatterplots are useful for studying the mean and variance functions in the
regression of the y-variable on the x-variable. In addition, scatterplots can help
us identify outliers—points that have values of the response far different from
the expected value—and leverage points—cases with extremely large or small
values of the predictor. How these ideas relate to multiple regression is a topic
discussed in Chapter 6.
PLOT ENHANCEMENTS
Scatterplots can be enhanced by adding curves to the graphs and by
identifying unusual points. A scatterplot smoother provides a visual estimate of
the regression function, either using a statistical model such as simple linear
regression or nonparametrically, without specifying the shape of the regression
curve explicitly.
The scatterplot function in the car package draws scatterplots with
smoothers, as in Figure 3.8:
> scatterplot(prestige ~ income, span=0.6, lwd=3,
+ id.n=4, data=Prestige)
Figure 3.8 Enhanced scatterplot of prestige by income . Several points were identified automatically.
The variables for the coded scatterplot are given in a formula as y~x | g ,
which we read as plotting y on the vertical axis and x on the horizontal axis and
marking points according to the value of g (or “y vs. x given g ”).
We selected a large span, span=0.75 , for the lowess smoothers because of the
small number of observations in the occupational groups. The legend for the
graph, automatically generated by the scatterplot function, can be suppressed
with legend.plot=FALSE . We set the colors for points and lines with the
argument col=gray(c(0, 0.5, 0.7)) , which generates three levels of gray for
the three occupational types. If we omit the col argument, scatterplot will
select the colors for us. The argument id.n=0 was included as a reminder that we
could have specified automatic point marking by setting id.n to the number of
points to be labeled; this argument was unnecessary because id.n=0 is the
default value.
Figure 3.9 allows us to examine three regression functions simultaneously: E(
prestige |income , type = bc ), E( prestige |income , type = wc ), and E(
prestige |income , type = prof ). The nonlinear relationship in Figure 3.8 has
disappeared, and we now have three reasonably linear regressions with different
slopes. The slope of the relationship between prestige and income looks
steepest for blue-collar occupations and looks least steep for professional and
managerial occupations.
Figure 3.9 Scatterplot of prestige by income , coded by type of occupation.
JITTERING SCATTERPLOTS
Discrete, quantitative variables typically result in uninformative scatterplots.
The example in Figure 3.10a was produced by the plot command:
> head(Vocab)
> nrow(Vocab)
[1] 21638
The data for this illustration, from the Vocab data frame in the car package,
come from the U.S. General Social Surveys, 1972–2004, conducted by the
National Opinion Research Center. The two variables in the plot are education
in years and the respondent’s score on a 10-word vocabulary test.
Figure 3.10 Scatterplots of vocabulary by education: (a) unjittered, (b) default jittering, and (c) twice
default jittering, with least-squares and lowess lines.
Because education can take on only 21 distinct values and vocabulary only 11
distinct values, most of the nearly 22,000 observations in the data set are
overplotted; indeed, almost all the possible 21 × 11 = 231 plotting positions are
occupied, producing a meaningless rectangular grid of dots.
Jittering the data by adding a small random quantity to each coordinate serves
to separate the overplotted points. We can use the jitter function in R for this
purpose:
> plot(jitter(vocabulary) ~ jitter(education), data=Vocab)
The result is shown in Figure 3.10b. We can control the degree of jittering via
the argument factor ; for example, specifying factor=2 doubles the jitter,
yielding a more satisfactory result for the current example:
> plot(jitter(vocabulary, factor=2) ~ jitter(education, factor=2),
+ col="gray", cex=0.5, data=Vocab)
To render the individual points less prominent, we plot them in gray and use the
argument cex=0.5 to make the points half the default size. To complete the
picture, we add least-squares and nonparametric-regression lines, using the
original unjittered data for these computations, producing Figure 3.10c:
> with(Vocab, {
+ abline(lm(vocabulary ~ education), lwd=3, lty="dashed")
+ lines(lowess(education, vocabulary, f=0.2), lwd=3)
+ })
The least-squares line on the graph is computed by lm and drawn by abline ; the
argument lwd to abline sets the width of the regression line, while the line type
lty="dashed" specifies a broken line. The lowess function returns the
coordinates for the local-regression curve, which is drawn by lines ; the span of
the local regression is set by the argument f to lowess , and we take advantage
of the very large data set by using a small span. The relationship between
vocabulary and education appears nearly linear, and we can also discern other
features of the data that previously were hidden by over plotting, such as the
relatively large number of respondents with 12 years of education.
We could have more conveniently used the jitter argument to the
scatterplot function in the car package to make the graphs in Figures 3.10b
and c, but we wanted to demonstrate how to construct a simple plot from its
components (a topic described in detail in Chapter 7).
[1] 248
The variables in the data set include the assets of the corporation in millions of
dollars; the corporation’s sector of operation, a factor (i.e., categorical variable)
with 10 levels (categories); the nation in which the firm is controlled, CAN
(Canada), OTH (other), UK , and US ; and the number of interlocking directorate
and executive positions (interlocks ) maintained between each company and
others in the data set. Figure 3.11a shows a boxplot of the number of interlocks
for each level of nation :
Figure 3.11 (a) Parallel boxplots of interlocks by nation of control, for Ornstein’s interlocking-
directorate data. (b) A mean/standard deviation plot of the same data.
> Boxplot(interlocks ~ nation, data=Ornstein, main="(a)")
Because the names of the companies are not given in the original source, the
points are labeled by case numbers. The firms are in descending order by assets,
and thus, the identified points are among the largest companies.
A more common plot in the scientific literature is a graph of group means with
error bars showing ±1 SD around the means. This plot can be drawn
conveniently using the plotCI function in the plotrix package, as shown, for
example, in Figure 3.11b:9
> library(plotrix)
> means <− with(Ornstein, tapply(interlocks, nation, mean))
> sds <− with(Ornstein, tapply(interlocks, nation, sd))
> plotCI(1:4, means, sds, xaxt="n", xlab="Nation of Control",
+ ylab="interlocks", main="(b)", ylim=c(0, 100))
> lines(1:4, means)
> axis(1, at=1:4, labels = names(means))
The tapply function (described in Section 8.4) is used here to compute the
means and SDs for each level of nation . The graph is drawn with plotCI . The
first argument gives the coordinates on the horizontal axis, the second gives the
coordinates on the vertical axis, and the third is the vector of SDs. The standard
graphical argument xaxt="n" suppresses the x-axis tick marks and labels, and
the ylim argument is used here to match the vertical axis of Panel b with that of
Panel a. The lines function joins the means with lines, and the axis function
labels the horizontal axis with the names of the groups.
The parallel boxplots in Figure 3.11a and the mean/SD plot in Figure 3.11b
purport to provide similar information, but the impression one gets from the two
graphs is very different. The boxplots allow us to identify outliers and recognize
skewness, with a few larger values in each level of nation . The mean/SD graph
is misleading: Instead of showing the outliers, the graph inflates both the mean
and the SD for Canada and disguises the skewness that is obvious in the
boxplots. Both graphs, however, suggest that the variation among firms within
nations is greater than the differences between nations.
Figure 3.12 Three-dimensional scatterplot for Duncan’s occupational-prestige data, showing the least-
squares regression plane. Three unusual points were labeled automatically.
Figure 3.13 Scatterplot matrix for the Canadian occupational-prestige data, with density estimates on the
diagonal.
3.4.1 LOGARITHMS
The single most important transformation of a strictly positive variable is the
logarithmic transformation.12 You will encounter natural logs to the base e ≈
2.718, common logs to the base 10, and sometimes logs to the base 2, but the
choice of base is inconsequential for statistical applications because logs to
different bases differ only by multiplication by a constant. For example, loge( x)
= log2( x) / log2( e) ≈ 0.692 log2( x). That said, common logs and logs to the base
2 can simplify the interpretation of results: For example, increasing the common
log by 1 multiplies the original quantity by 10, and increasing log2 by one
multiplies the original quantity by 2.
The R functions log , log10 , and log2 compute the natural, base−10, and
base−2 logarithms, respectively:
> log(7) # natural logarithm
[1] 1.946
[1] 0.8451
[1] 2.807
[1] 1.946
Figure 3.14 Distribution of assets in the Ornstein data set (a) before and (b) after log transformation.
The exp (exponential) function computes powers of e, and thus, exp(1) = e. The
log and logb functions can also be used to compute logs to any base:
> log(7, base=10) # equivalent to log10(7)
[1] 0.8451
[1] 0.8451
The log functions in R work on numeric vectors, matrices, and data frames. They
return NA for missing values, NaN (not a umber) for negative values, and -Inf
(−∞) for zeros.
The Ornstein data set (introduced in Section 3.2.2) includes measurements of
the assets of n = 248 large Canadian companies. Density plots of assets and their
logs are shown in Figure 3.14:
> par(mfrow=c(1, 2))
> with(Ornstein, plot(density(assets), xlab="assets", main="(a)"))
> with(Ornstein, plot(density(log10(assets)),
+ xlab="base−10 log of assets", main="(b)"))
The command par(mfrow=c(1, 2)) produces two panels in the graph, displayed
horizontally (see Section 7.1). Figure 3.14a is a typical distribution of a variable
that represents the size of objects—what Tukey (1977) calls an amount. In this
case, size is measured in dollars. Most of the data values are reasonably similar,
but a few values are very large, and the distribution is consequently positively
skewed. Logarithms spread out the small values and compress the large ones,
producing the more symmetric distribution seen in Figure 3.14b. The log
transformation does not achieve perfect symmetry, however, and there is a
suggestion that the distribution of the log-transformed variable has more than
one mode, a property of the data that is disguised by the skew in Figure 3.14a.
Nevertheless the log-transformed data are far better behaved than the
untransformed data.
Figure 3.15 Infant mortality rate and gross domestic product per capita, from the United Nations data set:
(a) untransformed data and (b) both variables log-transformed.
Logarithms are sufficiently important in data analysis to have a rule: For any
strictly positive variable with no fixed upper bound whose values cover two or
more orders of magnitude (i.e., powers of 10), replacement of the variable by its
logarithm is likely helpful. Conversely, if the range of a variable is considerably
less than an order of magnitude, then transformation by logarithms, or indeed
any simple transformation, is unlikely to make much of a difference.
The data frame UN in the car package, with data obtained from the United
Nations, contains the infant.mortality rate (infant deaths per 1,000 live births)
and gdp [per capita gross domestic product (GDP), in U.S. dollars] for 207
countries in 1998:13
> scatterplot(infant.mortality ~ gdp, data=UN, xlab="GDP per Capita",
+ ylab="Infant Mortality Rate (per 1000 births)", main="(a)",
+ boxplot=FALSE)
The graph in Figure 3.15a simply plots the data as provided. In Panel b, we used
the argument log="xy" , which also works with plot , to draw both axes on log
scales but to label the axes in the units of the original variables.
The dominant feature of the plot in Figure 3.15a is that many of the countries
are very poor, and the poorer countries have highly variable infant mortality
rates. There is very little visual resolution, as nearly all the data points
congregate at the far left of the graph. The lowess smooth, however, suggests
that average infant.mortality decreases with gdp , steeply at first and then at a
decreasing rate. The log-scale plot tells a clearer story, as the points now fall
close to a straight line with a negative slope, also suggesting that increasing gdp
corresponds to decreasing infant.mortality . A few of the points, however, are
relatively far from the least-squares line, with Tonga and (perhaps surprisingly)
Bosnia having relatively low infant mortality rates for their GDP, and Iraq and
Afghanistan having relatively high infant mortality. The transformations to log
scales achieve visual interpretability, near-linearity, and constant variance across
the plot.
From Figure 3.15b, we can posit a regression model of the form
Call:
lm(formula = log(infant.mortality) ~ log(gdp), data = UN)
Coefficients:
(Intercept) log(gdp)
7.045 −0.493
For the estimated slope b1 = −0.493, we have 1.01−0.493 = 0.995, and so the
estimated infant.mortality would be 0.5% smaller—a substantial amount. Put
another way, if we compare pairs of countries that differ by 1% in their gdp , on
average the country with the 1% higher gdp will have a 0.5% lower
infant.mortality , a percentage approximately equal to the estimated
regression coefficient, b1 = −0.493. Economists call a coefficient such as β1 in a
log-log regression an elasticity.
For λ ≠ 0, the scaled power transformations are essentially xλ, because the
scaled-power family only subtracts 1 and divides by the constant λ. One can
show that as λ approaches 0, TBC( x, λ) approaches loge( x), and so the scaled-
power family includes the invaluable log transformation as a special case,
whereas the basic power family doesn’t. Also, the scaled power TBC( x, λ)
preserves the order of the x values, while the basic powers preserve order only
when λ is positive and reverse the order of x when λ is negative.
The family of scaled-power transformations was first used in a seminal paper
by Box and Cox (1964), and this family is often called the Box-Cox (BC)
transformations in their honor.15 Box and Cox used the scaled powers to help
determine the transformation of the response in linear regression (a topic that we
will discuss in Section 6.4.1).
The scaled-power family can be computed using the bcPower function in the
car package; for example, for λ = 0.5,
> bcPower(1:5, 0.5)
The families of basic and Box-Cox powers are applicable only when the data
to be transformed are all positive and generally not when the data contain
negative values or 0s: Some of the power transformations—for example, square
root and log—are undefined for negative values, and others—for example,
square—won’t preserve the order of the data when both positive and negative
values are present. A simple solution is to add a sufficiently large positive
constant, called a start by Mosteller and Tukey (1977), to the data to make all
the data positive prior to transformation.
Transformations are effective only when the data span a sufficiently large
range. When the ratio of the largest to the smallest data value is not much higher
than 1, the transformations are nearly linear and thus don’t bend the data. A
negative start can be used to move the data closer to 0, increasing the ratio of the
largest to the smallest value and making the transformations more effective.
The car package also includes a function for a second family of power
transformations, TYJ( x, λ), due to Yeo and Johnson (2000), which can be used
when the variable to be transformed is not strictly positive. The Yeo-Johnson
family is defined as TBC( x + 1, λ) for nonnegative values of x and TBC( −x + 1, 2
− λ) for negative values of x. Yeo-Johnson powers are computed by the yjPower
function:
> yjPower(-5:5, 0.5)
The resulting graph is shown in Figure 3.16. By default, symbox uses the
function bcPower and displays boxplots of the transformed variable for several
transformations down the ladder of powers; here, the log transformation of gdp
does the best job of making the distribution of the variable symmetric.
Figure 3.16 Boxplots of various power transformations of gdp in the United Nations data.
Tasinsqrt( x) = sin−1 ( )
and is computed in R as, for example,
> asin(sqrt(seq(0, 1, length=11)))
The logit transformation is not defined for sample proportions exactly equal to 0
or 1, however. We can get around this limitation by remapping proportions from
the interval [0, 1] to [.025, .975], for example, taking the logit of .025 + .95×p
rather than the logit of p. The logit function in the car package takes care of
remapping proportions or percentages when there are 0s or 1s, or 0% or 100%
for percentage data, printing a warning if remapping is required:
> logit(seq(0.1, 0.9, 0.1))
Warning message:
In logit(seq(0, 1, 0.1)) : Proportions remapped to (0.025,0.975)
Even better, if we have access to the original data from which the proportions are
calculated, we can avoid proportions of 0 or 1 by computing empirical
logits,log[( x + 1/2) /( n + 1) ], where x is the number of successes in n trials.
We apply the logit and arcsine square-root transformations to the distribution
of the gender composition of occupations in the Canadian occupational-prestige
data:
> par(mfrow=c(1, 3))
> with(Prestige, {
+ plot(density(women, from=0, to=100),
+ main="(a) Untransformed")
+ plot(density(logit(women), adjust=0.75),
+ main="(b) Logit")
+ plot(density(asin(sqrt(women/100)),
+ adjust=0.75), main="(c) Arcsine square-root")
+ })
The resulting density plots are shown in Figure 3.17. The density plot for the
untransformed percentages is confined to the domain 0 to 100. The
untransformed data, in Panel a, stack up near the boundaries, especially near 0.
The logit-transformed data, in Panel b, appear better behaved, and the density
plot reveals three apparent concentrations or groups of occupations. The arcsine
square-root transformed data, in Panel c, are similar. We adjusted the bandwidth
of the density estimators for the transformed data to resolve the third peak in the
distribution. When there are multiple modes in a distribution, the default
bandwidth is often too large.
Figure 3.17 Distribution of women in the Canadian occupational-prestige data: (a) untransformed, (b) logit
transformed, and (c) arcsine square-root transformed.
1. People’s height and weight are combined in the body mass index,BMI =
weight/( height2), which is intended to measure body composition.
2. A variable like month.number can be replaced by sin( month.number /12)
and cos( month.number /12) to model seasonal time trends.
3. In a study of highway accident rates, a variable giving the number of
signals in a highway segment is converted to the number of signals per mile
by dividing by the length of the segment. More generally, measures of total
size often need to be converted to size per unit—for example, converting
GDP to a per capita basis by dividing by population.
The upshot of these examples is that one should think carefully about how
variables are expressed in the substantive context in which they are used in
research.
Figure 3.18 Spread-level plot for the relationship between number of interlocks and nation of control in
Ornstein’s interlocking-directorate data.
The spreads in the transformed data for the four groups are much more similar
than the spreads in the untransformed data shown in Figure 3.11 (p. 122).
We used two arguments in the call to Boxplot : a formula with the base−10
logarithm of interlocks + 1 on the left-hand side and the factor nation on the
right-hand side; and the data frame in which these variables reside, Ornstein .
The remaining commands make the graph more elaborate, by first increasing the
right-side margin of the plot and then adding a second axis on the right, labeled
in the original units of interlocks , produced by the basicPowerAxis function
in the car package.16 The argument power=0 to basicPowerAxis specifies the
log transformation; base=10 , the base used for the logs; at=c(1, 3, 6, 11,
21, 51, 101) , where the tick marks are to appear on the interlocks + 1 scale;
and start=1 , the start that was used, so that the tick labels can be adjusted to the
originalinterlocks scale. The functions bcPowerAxis , yjPowerAxis , and
probabilityAxis in the car package may be used similarly to produce axes on
the untransformed scale corresponding to Box-Cox, Yeo-Johnson, and logit
transformations. Finally, we restored the graphical settings in par to their
original values so that future graphs will not have extra space on the right. An
alternative would have been simply to close the graphics device window,
because a subsequently opened graphics device would revert to the default
settings of graphical parameters such as mar .
A variance-stabilizing (or spread-stabilizing) transformation can also be
selected formally by estimating a transformation parameter (described in Section
3.4.7).
Figure 3.20 Mosteller and Tukey’s bulging rule for finding linearizing transformations: When the bulge
points down, transform y down the ladder of powers; when the bulge points up, transform y up; when the
bulge points left, transform x down; when the bulge points right, transform x up.
Figure 3.21 (a) Inverse transformation plot for the relationship of prestige to income in the Prestige data,
and (b) scatterplot of prestige versus income 1/3.
More generally, imagine fitting a regression for generic y and x variables that
is made linear by a yet-to-be-determined Box-Cox power transformation of x,
In Figure 3.21a, we explicitly set the label for the horizontal axis of the graphs,
and set col.lines to be shades of gray for printing in this book. The
invTranPlot function also produces printed output, displaying the values of λ
that it used and the corresponding residual sums of squares for the regression
specified by Equation 3.2. Small values of the residual sum of squares indicate
better agreement between the fitted line and the data.
In the code to draw Figure 3.21b, we used the I (identity) function so that
income^(1/3) produces the cube root of income ; otherwise the ^ operator has a
special meaning in a model formula (see Section 4.8.1); and we use the
expression function to typeset the superscript 1/3 in the x-axis label.
As shown in Figure 3.21a, none of the curves produced by λ in ( −1, 0, 1),
with the possible exception of 0 (i.e., log), match the data well, but λ close to 1/3
does a good job, except perhaps at the highest levels of income . Figure 3.21b
shows the result of taking the cube root of income.
In light of our earlier discussion, a cube-root transformation rather than a log
transformation of income is unexpected. We saw, however, in Section 3.2.1 that
the relationship between income and prestige is different for each level of type
, so choosing a transformation ignoring type ,as we have done here, can be
misleading.
Because the UN data frame has only two numeric columns, we can simply use the
data frame as the first argument to powerTransform . An equivalent command
would be
> with(UN, summary(powerTransform(cbind(infant.mortality, gdp))))
The cbind function turns the two variables into a two-column matrix, and a
scaled-power transformation parameter is estimated for each column. Were there
more variables in the data frame, we could also select the variables of interest by
indexing the data frame (see Section 2.3.4), as in
> summary(powerTransform(UN[ , c("infant.mortality", "gdp")]))
The estimated powers for both variables are near 0, and λ = 0 (the log
transformation) is within the marginal confidence interval for each
transformation parameter. Also given in the output are two likelihood ratio tests,
the first that logs are appropriate for all the variables simultaneously, and the
second that no transformation, λ = 1, is necessary for any variable, against the
alternative that at least one variable requires transformation. The first test has a
large p value, suggesting that logs will work well here, as we have seen in Figure
3.15 (p. 129), and the second has a tiny p value, suggesting that using the
untransformed data is a bad idea.
The powerTransform function also allows for estimating transformations of
variables after conditioning on a set of predictors that are not transformed. The
most common example of this situation is the transformation of one or more
numeric variables after adjusting for a grouping factor or factors. For example,
we used a transformation of number of interlocks in Ornstein’s Canadian
interlocking-directorate data (in Section 3.4.5) in an attempt to equalize the
variance within each national group. The powerTransform function can be used
to find a variance-stabilizing transformation:
> summary(powerTransform(interlocks ~ nation,
+ data=Ornstein, family="yjPower"))
The variables to the left of the ~ in the formula—in this case, just interlocks —
will be transformed, while the variables to the right are the conditioning
variables—in this case, just the single predictor nation . This command suggests
a Yeo-Johnson power transformation of interlocks so that the data are as close
as possible to normally distributed within each nation and so that the variances
in the several nations are as similar as possible. By using family="yjPower" ,
we avoid explicitly having to add 1 to the variable, because the Yeo-Johnson
family permits 0 values.
The indicated power is about 0.14, which is close to the value of 0 for a log
transformation, although 0 is excluded by both the Wald interval and the
likelihood ratio test. Whether one would actually use an unusual power such as λ
= 0.14 in preference to the similar log transformation, however, is partly a matter
of taste. We will return to Ornstein’s data in Chapters 5 and 6.
Looking at the likelihood ratio tests, untransformed variables, λ =(1,1)', and all
logarithms, λ =(0,0)', both have small significance levels, suggesting that neither
of these choices is appropriate. The estimated transformation for income is =
0.26, and the cube root (λ1 = 1/3) is in the Wald confidence interval, while the
log transformation, λ1 = 0, is just outside the interval. In contrast, the value λ2 =
1, representing no transformation of education , is inside the marginal
confidence interval for λ2, which is very broad. Because rounding to nice powers
is standard practice, we test the hypothesis that λ =(1/3, 1)':
> testTransform(p1, lambda=c(0.33, 1))
The p-value is about 0.30. This test is also shown in the summary output for
powerTransform . Because we favor logarithms, let’s also try λ =(0,1)':
> testTransform(p1, lambda=c(0, 1))
The p value is about 0.01, suggesting that the log transformation of income is not
adequate. Nevertheless, we suggest using log(income) as a starting point for
analysis because this is the standard transformation for a variable such as income
. A prudent approach would be to repeat the analysis employing the cube root of
income to see if any differences result. In Figure 3.22, we have used base−2
logarithms:
> scatterplotMatrix(~ prestige + log2(income) + education + women,
+ span=0.7, data=Prestige)
The panels in this scatterplot matrix show little nonlinearity, although in the plot
of prestige versus log2(income) , the two occupations with the lowest income
appear not to fit the trend in the rest of the data.
We can retrieve the estimated transformations using the coef function:
> coef(p1)
income education
0.2617 0.4242
income education
0.33 1.00
where the rounded value is the first element of (1, 0, 0.5, 0.33, −0.5, −0.33, 2,
−2) in the confidence interval for a transformation parameter. If none of these
values are in the confidence interval for the transformation parameter, the
unrounded estimate is provided.
We can add transformed variables to the data using the transform function (as
discussed in Section 2.2.5):
Figure 3.22 Scatterplot matrix with transformed predictors for the Prestige data set.
> Prestige <− transform(Prestige, log2income = log2(income))
The variable type is a factor that divides occupations into white collar, blue
collar, and professional/managerial categories. We can also seek transformations
that will normalize the distributions of the numeric predictors within each level
of type , as would be desirable if type were to be included as a predictor:
> summary(p2 <− powerTransform(cbind(income, education) ~ type,
+ data=Prestige))
We set the argument by.group=TRUE to get separate least-squares fits for each
group and specified gray-scale colors appropriate for this book. We suppressed
the lowess smooths with the argument smooth=FALSE to minimize clutter in the
graph. Approximate within-group linearity is apparent in most of the panels of
Figure 3.23. The plotted points are the same in Figures 3.22 and 3.23—only the
point marking and the fitted lines are different. In particular, the two points with
low income no longer appear exceptional when we control for type .
Identifying extreme points can be particularly valuable in graphs used for model
building and diagnostics. Standard R includes one function for this purpose,
identify , which allows interactive point identification. Many users find
identify inconvenient, and so the ability to mark extreme points automatically
can be helpful, even if it is not as desirable in general as interactive
identification. The graphical functions in the car package—including the
scatterplot , scatterplotMatrix , scatter3d , and invTran-Plot functions
discussed in this chapter and others to be introduced subsequently— employ a
common strategy for identifying points. In this section, we describe both the
identify function and the point identification scheme used in the car package.
Except for x , y , and … , these arguments can also be set in car higher-level
plotting functions such as scatterplot or invTranPlot . When you call
showLabels directly, the default is to use the identify function to label as many
cases as you like using the mouse as described in the last section. Usually, you
will use this function by setting arguments to the car graphics functions, and for
these, the defaults are different. All car graphics functions have id.n=0 by
default for no point identification. The default id.method for car functions
depends on the function. To get any point identification, you need to set id.n to
a positive number or set method="identify" for interactive point identification
(see below). Here is a description of the various arguments to showLabels :
labels is a vector of labels to be used for the points. Most higher-level plotting
functions will find reasonable labels on their own, but sometimes we are
required to supply labels. In these instances, if no labels are provided, then
case numbers are used.
id.method selects the method for identifying points. There are several built-in
methods selected by setting id.method to a quoted string, or you can make
up your own method. The options "x" and "y" will select extreme values on
the x-axis and the y-axis, respectively. If you want both types of extremes
identified, set id.method=list("x", "y") . For use with the scatter3d
function, there is also an argument "z" . The argument id.method="mahal"
labels points with the largest Mahalanobis distance from the point ( ¯x, ¯y).
This is most appropriate when studying the joint distribution of the two
plotted variables. Labeling by Mahalanobis distance is the default for
scatterplot , scatterplotMatrix , and (employing three-dimensional
Mahalanobis distances) scatter3d ; other functions may use different
defaults.
If m1 is a regression model, then id.method=abs(residuals (m1)) will
select the cases with the largest absolute residuals. Finally, as long as id.n
is a positive value, setting id.method=c(1, 4, 7) would label cases
number 1, 4, and 7 only, regardless of the value of id.n .
id.n controls the number of points marked. The value id.n=0 turns off point
marking.
id.cex , id.col determine the relative size and the color of the point labels,
respectively.
… allows additional arguments that can be passed to either the identify function
or the text function.
Mosteller and Tukey (1977) and Tukey (1977) were very influential in
convincing applied statisticians of the necessity of looking at their data
through a variety of graphs.
Most of the material in this chapter on examining data is covered in Fox
(2008, chap. 3). Power transformations (Section 3.4.2) are discussed in Fox
(2008, chap. 4).
Plotting in general and scatterplot matrices in particular are also covered in
Weisberg (2005, chap. 1). Much of the material in Section 3.4 is covered in
Weisberg (2005, chaps. 7–8), although some of the discussion in this
section on transformations to multivariate normality conditioning on other
variables is new material.
There is a vast literature on density estimation, including Bowman and
Azzalini (1997) and Silverman (1986).
The lowess smoother used in the scatterplot function and elsewhere in
the car package was introduced by Cleveland (1979) and is but one of
many scatterplot smoothers, including regression and smoothing splines,
kernel regression, and others (see, e.g., Fox, 2000b). Jittering was
apparently proposed by Chambers et al. (1983).
The OpenGL graphics standard is discussed at http://www.opengl.org .
Cook and Weisberg (1999) includes an extended treatment of three-
dimensional plots in regression.
The Wikipedia article on logarithms (at
http://en.wikipedia.org/wiki/Logarithms ) provides a nice
introduction to the topic. Fox (2000a) describes logarithms, powers, and
other mathematics for social statistics.
____________________________
1The Canadian occupational-prestige data set, on which this example is based, was introduced in Section
2.1.2.
2The general use of color in R is discussed in Section 7.1.4.
3See Sections 1.4 and 8.7 on object-oriented programming in R for a detailed explanation of generic
functions and their methods.
4Several freely available R packages provide more sophisticated facilities for density estimation. See, in
particular, the sm package (Bowman and Azzalini, 1997) and the locfit package (Loader, 1999).
5See Section 3.5 for further discussion of point labeling.
6The standard R boxplot function can also be used to draw boxplots, but Boxplot is more convenient,
automatically identifying outliers, for example; indeed, Boxplot is simply a front-end to boxplot .
7The behavior of generic functions such as plot is discussed in Sections 1.4 and 8.7, and more information
about the plot function is provided in Section 3.2.3 and in Chapter 7 on R graphics.
8We use a smoother here and in most of this book as a plot enhancement, designed to help us derive
information from a graph. Nonparametric regression, in which smoothers are substituted for more
traditional regression models, is described in the online appendix to the book. Kernel regression, which is
similar to lowess, is described in Section 7.2.
9These plots are sometimes drawn with intervals of ±1 standard error rather than ±1SD, andsome-times
these error bars are added to bar charts rather than to a scatterplot of means. We discourage the use of bar
charts for means because interpretation of the length of the bars, and therefore the visual metaphor of the
graph, depends on whether or not a meaningful origin exists for the measured variable and whether or not
the origin is included in the graph. The error bars can also lead to misinterpretation because neither the
standard-error bars nor the standard-deviation bars are the appropriate measure of variation for comparing
means between groups, because they make no allowance or correction for multiple testing, among other
potential problems.
10The class-based object-oriented programming system in R and its implementation through generic
functions such as plot are explained in Sections 1.4 and 8.7.
11Unlike simple Euclidean distance, which is inappropriate when the variables are scaled in different units,
the Mahalanobis distance takes into account the variation and correlational structure of the data.
12If you are unfamiliar with logarithms, see the complementary readings cited at the end of the chapter.
13Some data are missing, however, and only 193 of the 207 countries have valid data for both variables.
14The exp( ) notation represents raising the constant e to a power: Thus, exp( β0) = eβ0 .
15Box and Cox is also the title of an operetta by Gilbert and Sullivan and an 1847 farce by John Maddison
Morton, although the operetta and the play have nothing whatever to do with regression or statistics.
16Many global graphics parameters in R are set or queried with the par function. The mar setting is for the
plot margins; see ?par for details. Graphics parameters are also discussed in Section 7.1.
Fitting Linear Models 4
4.1 Introduction
The quantity ε is called an error, with E( ε|x1 , … , xk) = 0 and Var( ε|x1, … , xk) =
σ2.
Changing assumptions changes the model. For example, it is common to add a
normality assumption,
producing the normal linear model. The normal linear model provides more
structure than is required for fitting linear models by least squares, although it
furnishes a strong justification for doing so.
Another common extension to the linear model is to modify the constant
variance assumption to
for known positive weights w, producing the weighted linear model. There are
myriad other changes that might be made to the basic assumptions of the linear
model, each possibly requiring a modification in methodology.
The basic R function for fitting linear-regression models by ordinary least
squares (OLS)or weighted least squares (WLS) is the lm function, which is the
primary focus of this chapter.
> nrow(Davis)
[1] 200
The variables weight and repwt are in kilograms, and height and repht are in
centimeters. One of the goals of the researcher who collected these data (Davis,
1990) was to determine whether the reports of height and weight are sufficiently
accurate to replace the actual measurements, which suggests regressing each
measurement on the corresponding report. We focus here on measured weight
(weight ) and reported weight (repwt ).
This problem has response y = weight and one predictor, repwt , from which
we obtain the regressor variable x1 = repwt . The simple linear-regression model
is a special case of Equation 4.1 with k = 1. Simple linear regression is fit in R
via OLS, using the lm function:
> davis.mod <− lm(weight ~ repwt, data=Davis)
The formula argument describes the response and the linear predictor, and is the
only required argument. The data argument optionally gives a data frame that
includes the variables to be used in fitting the model.3
The formula syntax was originally proposed by Wilkinson and Rogers (1973)
specifically for use with linear models. Formulas are used more generally in R,
but their application is clearest for regression models with linear predictors, such
as the linear-regression models discussed in this chapter and the generalized
linear models taken up in the next.
A model formula consists of three parts: the left-hand side, the ~ (tilde), and
the right-hand side. The left-hand side of the formula specifies the response
variable; it is usually a variable name (weight , in the example) but may be an
expression that evaluates to the response (e.g., sqrt(weight) , log(income) ,or
income/hours.worked ). The tilde is a separator. The right-hand side of the
formula is a special expression including the names of the predictors that R
evaluates to produce the regressors for the model. As we will see later in this
chapter, the arithmetic operators, + , - , * , / , and ^ , have special meaning on
the right-hand side of a model formula; they retain their ordinary meaning,
however, on the left-hand side of the formula.
R will use any numeric predictor on the right-hand side of the model formula
as a regressor variable, as is desired here for simple regression. The intercept is
included in the model without being specified directly. We can put the intercept
in explicitly using weight ~ 1 + repwt ,however,or force the regression
through the origin using weight ~ −1 + repwt or weight ~ repwt − 1 . A
minus sign explicitly removes a term—here the intercept—from the linear
predictor. Using 0 in a formula also suppresses the intercept: weight ~ 0 +
repwt . As subsequent examples illustrate, model formulas can be much more
elaborate (and are described in detail in Section 4.8.1).
The lm function returned a linear-model object, which we saved in davis.mod
. We call other functions with davis.mod as an argument to produce and display
useful results. As with any R object, we can print davis.mod by typing its name
at the R command prompt:
> davis.mod
Call:
lm(formula = weight ~ repwt, data = Davis)
Coefficients:
(Intercept) repwt
5.336 0.928
Call:
lm(formula = weight ~ repwt, data = Davis)
Residual standard error: 8.42 on 181 degrees of freedom
(17 observations deleted due to missingness)
Multiple R-squared: 0.699, Adjusted R-squared: 0.697
F-statistic: 420 on 1 and 181 DF, p-value: <2e−16
Figure 4.1 Scatterplot of measured weight (weight ) by reported weight (repwt ) for Davis’s data.
[1] "12"
The graph, shown in Figure 4.1, reveals an extreme outlier, Observation 12,
flagged on the graph by the argument id.n=1 . The argument smooth= FALSE
suppresses the lowess smoother (see Section 3.2.1). It seems bizarre that an
individual who weighs more than 160 kg would report her weight as less than 60
kg, but there is a simple explanation: On data entry, Subject 12’s height in
centimeters and weight in kilograms were inadvertently exchanged.
The proper course of action would be to correct the data, but to extend the
example, we instead will use the update function to refit the model by removing
the 12th observation:
> davis.mod.2 <− update(davis.mod, subset=−12)
> summary(davis.mod.2)
Call:
lm(formula = weight ~ repwt, data = Davis, subset = −12)
The update function can be used in many circumstances to create a new model
object by changing one or more arguments. In this case, setting subset=−12
refits the model by omitting the 12th observation. (See Section 4.8.3 for more on
the subset argument.)
Extreme outliers such as this one have several effects on a fitted model, which
will be explored more fully in Chapter 6. They can sometimes determine the
value of the estimated coefficients:
> cbind(Original=coef(davis.mod), NoCase12=coef(davis.mod.2))
The cbind (column bind ) function binds the two vectors of coefficient estimates
into a two-column matrix. The first column gives the coefficient estimates using
all the data and the second after deleting Case 12. Only the intercept changes,
and not dramatically at that, because Case 12 is a relatively low-leverage point,
meaning that its value for the predictor is near the center of the distribution of
predictor values (see Section 6.3). In contrast, there is a major change in the
residual standard error, reduced from an unacceptable 8.4 kg to a possibly
acceptable 2.2 kg. Also, the value of R2 is greatly increased. Finally, the F and t
tests discussed previously are not reliable when outliers are present.
4.2.2 MULTIPLE REGRESSION
Multiple regression extends simple regression to allow for more than one
regressor. To provide an illustration, we return to the Canadian occupational-
prestige data (introduced in Chapter 2):
> head(Prestige)
> nrow(Prestige)
[1] 102
Just as simple regression should start with a graph of the response versus the
predictor, multiple regression should start with the examination of appropriate
graphs, such as a scatterplot matrix. In Section 3.4, we constructed a scatterplot
matrix for the predictors education (the average number of years of education
of the occupational incumbents), income (their average income), and women (the
percentage of women in the occupation), and the response variable prestige
(Figure 3.13, p. 126); based on this graph, we suggested replacing income by its
logarithm. The resulting scatterplot matrix (Figure 3.22, p. 144), in which little
or no curvature is observed in any of the panels of the plot, suggests that this is a
good place to start regression modeling.
We fit Equation 4.1 for the response variable y = prestige , and from the
three predictors we derive k = 3 regressors, x1 = education , x2 =
log2(income ), and x3 = women . Thus, two of the three predictors are directly
represented in the model as regressors, and the other regressor is derived from
the remaining predictor, income . As in simple regression, we fit the model with
the lm function:
> prestige.mod <− lm(prestige ~ education + log2(income) + women,
+ data=Prestige)
The only difference between fitting a simple and multiple linear regression in R
is in the model formula: In a multiple regression, there are several predictors,
and their names are separated in the formula by + signs. R recognizes education
, log2(income) , and women as numeric variables and uses them as the three
regressors.
> summary(prestige.mod)
Call:
lm(formula = prestige ~ education + log2(income) + women, data = Prestige)
FACTORS
The values of qualitative variables are category labels rather than
measurements. Examples of qualitative variables are gender, treatment in a
clinical trial, country of origin, and job title. Qualitative variables can have as
few as two categories or a very large number of categories. An ordinal
categorical variable has categories that have a natural ordering, such as age class,
highest degree attained, or response on a 5-point scale with values from strongly
disagree to strongly agree.
We (and R) call qualitative variables factors and their categories, levels.In
some statistical packages, including SAS and SPSS, factors are called class
variables. Regardless of what they are called, factors are very common, and
statistical software should make some provision for including them in regression
models.
As explained in Section 2.1.2, when the read.table function reads a column
of a data file that includes at least some values that are neither numbers nor
missing-value indicators, it by default turns that column into a factor. An
example is the variable type (type of occupation) in the Prestige data frame:
> Prestige$type
> class(Prestige$type)
[1] "factor"
The three levels of the factor type represent blue-collar (bc ), professional and
managerial (prof ), and white-collar (wc ) occupations. The missing-value
symbol, NA , is not counted as a level of the factor. The levels were automatically
alphabetized when the factor was created, but the order of the factor levels can
be changed:
> levels(Prestige$type)
The reordered levels are in a more natural order. We can also coerce a factor into
a numeric vector:5
> type.number <− as.numeric(Prestige$type)
> type.number[select]
[1] 3 3 2 2 1 1
> class(type.number)
[1] "numeric"
The as.numeric function replaces each level of the factor by its level number,
producing a "numeric" result. It is also possible to coerce type into a vector of
character strings:
> type.character <− as.character(Prestige$type)
> type.character[select]
> class(type.character)
[1] "character"
Finally, we can turn type.character back into a factor with naturally ordered
levels:
> type.factor <− factor(type.character, levels=c("bc", "wc", "prof"))
> type.factor[select]
Suppose that we have a data frame named Drug with a discrete variable called
dosage , whose values are numeric—say 1, 2, 4, 8—indicating the dose of a
drug. When the data are read from a file, read.table by default will not make
this variable a factor, because all its values are numbers.6 We can turn dosage
into a factor using any of the following commands:
> Drug$dosage <− factor(Drug$dosage)
> Drug$dosage <− factor(Drug$dosage,
+ levels=c("8", "4", "2", "1"))
> Drug$dosage <− factor(Drug$dosage,
+ labels=c("D1", "D2", "D4", "D8"))
The first of these commands creates a factor with the levels "1", "2", "4",
"8" , treating the numbers as text. The second command orders the levels
differently. The third command keeps the default ordering but assigns
customized labels to the levels of the factor via the labels argument.
When a factor is included in a model formula, R automatically creates
regressors, called contrasts, to represent the levels of the factor. The default
contrast coding in R is produced by the function contr.treatment :If a factor
has m distinct levels, then contr.treatment creates m − 1 regressors, each of
which is dummy-coded, with values consisting only of 0s and 1s. For example,
suppose we have a factor z with four levels:
> (z <− factor(rep(c("a", "b", "c", "d"), c(3, 2, 4, 1))))
[1]a a a b b c c c c d
Levels: a b c d
We can see the dummy variables that will be created using the model.matrix
function, specifying a model formula with z on the right-hand side:
> model.matrix(~ z)
The first column of the model matrix is a columns of 1s, representing the
intercept. The remaining three columns are the dummy regressors constructed
from the factor z . To put it more compactly,
> contrasts(z)
When z = "a" , all the dummy variables are equal to 0; when z = "b" , the
dummy variable zb = 1; when z = "c" , zc = 1; and, finally, when z = "d" , zd =
1. Each level of z is therefore uniquely identified by a combination of the
dummy variables.
R has several other functions apart from contr.treatment for coding
contrasts, and if you don’t like the standard contrast codings, you can make up
your own. The results of an analysis generally don’t depend on the choice of
contrasts to define a factor, except in some testing situations that we will
describe later. Section 4.6 provides an extended discussion of contrast selection.
> nrow(Baumann)
[1] 66
Like the head function, the some function in the car package prints rows of its
argument: head , the first few rows, and some , a random few rows.
The researchers were interested in whether the new methods produce better
results than the standard method, and whether the new methods differ in their
effectiveness.
> xtabs(~ group, data=Baumann)
Figure 4.2 Post-test reading score by condition, for Baumann and Jones’s data.
When xtabs is used with a one-sided formula, as it is here, it will count the
number of cases in the data set at each level or combination of levels of the
right-hand-side variables, in this case just the factor group . The experimental
design has the same number, 22, of subjects in each group.
The tapply function (described in Section 8.4) computes the means and SDs for
each level of group . The means appear to differ, while the within-group SDs are
similar. Plotting a numeric variable against a factor produces a parallel boxplot,
as in Figure 4.2:
> plot(post.test.3 ~ group, data=Baumann, xlab="Group",
+ ylab="Reading Score")
The means and boxplots suggest that there may be differences among the groups,
particularly between the new methods and the standard one.
The one-way ANOVA model can be fit with lm :
> baum.mod.1 <− lm(post.test.3 ~ group, data=Baumann)
> summary(baum.mod.1)
Call:
lm(formula = post.test.3 ~ group, data = Baumann)
Residual standard error: 6.31 on 63 degrees of freedom
Multiple R-squared: 0.125, Adjusted R-squared: 0.0967
F-statistic: 4.48 on 2 and 63 DF, p-value: 0.0152
We specify the predictor group in the formula for the linear model. Because R
recognizes that group is a factor, it is replaced with dummy regressors. The
coefficients for the two regressors are the estimates of the difference in means
between the groups shown and the baseline group. For example, the estimated
mean difference between DRTA and basal is 5.68. The t value is the Wald test
that the corresponding population mean difference is equal to 0. The intercept is
the estimated mean for the baseline basal group. No test is directly available in
the summary output to compare the groups DRTA and Strat , but such a test
could be easily generated by refitting the model with a releveled factor, using the
relevel function to set "DRTA" as the baseline (or reference) level of group :
Call:
lm(formula = post.test.3 ~ relevel(group, ref = "DRTA"), data = Baumann)
The periods (“. ”) in the updated model formula represent the previous left-and
right-hand sides of the formula; we, therefore, updated the model by removing
the factor group and replacing it with its releveled version. The t test for the
comparison between DRTA and Strat has a p value close to .2.7
Call:
lm(formula = prestige ~ education + log2(income) + type, data = Prestige)
We once again used the update function rather than typing in a model from
scratch. The output indicates that four observations were deleted because of
missing data—that is, four of the occupations have the value NA for type . This
model has three predictors (education , income , and type ), which produce four
regressors plus an intercept—effectively one intercept for each level of type and
one slope for each of the numeric predictors (called covariates).
The estimate for typewc is the difference in intercepts between wc and the
baseline bc . The corresponding t value tests the hypothesis that these two levels
of type have the same intercept. Similarly, the typeprof coefficient is the
difference in intercepts between prof and bc . A test of the natural hypothesis
that all the group intercepts are equal will be given in Section 4.4.4.
There is a slope for each of the two numeric regressors, and according to this
model, the slopes are the same for each level of the factor type , producing
parallel regressions, or an additive model with no interactions. Because the
regression planes for the three levels of type are parallel, the typewc and
typeprof coefficients represent not only the differences in intercepts among the
groups but also the constant separation of the regression planes at fixed levels of
income and education .
It is common to compute adjusted means for the levels of a factor in an
analysis of covariance: Adjusted means are simply fitted values at the various
levels of a factor, when the means of the covariates are substituted into the linear
predictor of the estimated model. The adjusted means can be computed by the
effect function in the effects package:8
> library(effects)
> effect("type", prestige.mod.1)
[1] 1 2 35 36 61 62
Call:
lm(formula = prestige ~ education + log2(income) + type + log2(income):type +
education:type, data = Prestige)
For the baseline level bc , the intercept is −120.05, the slope for education
is 2.34, and the slope for log2(income) is 11.08.
For level wc , the intercept is −120.05 + 30.24 = −89.81, the slope for
log2(income) is 11.08+( −5.65) = 5.43, and the slope for education is 2.34
+ 3.64 = 5.98.
For level prof , the intercept is −120.05 + 85.16 = −34.89, the slope for
log2(income) is 11.08 + ( −6.54) = 4.54, and the slope for education is
2.34 + 0.70 = 3.04.
is not equivalent to
prestige ~ education*type + log2(income)*type
[1] 45
The 45 subjects in the experiment interacted with a partner who was of either
relatively low or relatively high status as recorded in the factor partner.status
. In the course of the experiment, the subjects made intrinsically ambiguous
judgments and shared these judgments with their partners. The partners were
confederates of the experimenter, and their judgments were manipulated so that
they disagreed with the subjects on 40 critical trials.
After exchanging initial judgments, the subjects were given the opportunity to
change their minds. The variable conformity records the number of times in
these 40 trials that each subject deferred to his or her partner’s judgment. The
variable fscore is a measure of the subject’s authoritarianism, and fcategory is
a categorized version of this variable, dissecting fscore into thirds, labeled low ,
medium , and high .
Using partner.status and fcategory as factors, Moore and Krupat
performed a two-way ANOVA of conformity .10 We start by reordering the
levels of the factors because the default alphabetical order is not what we want:
> Moore$fcategory <− factor(Moore$fcategory,
+ levels=c("low", "medium", "high"))
> Moore$partner.status <− relevel(Moore$partner.status, ref="low")
Figure 4.3 Conformity by authoritarianism and partner’s status, for Moore and Krupat’s data.
Because we fit the six cell means with six parameters, the reader can verify that
the sums shown in the table are identical to the directly computed cell means,
which were shown previously.
An additive (i.e., main effects only) model can be conveniently produced from
the full, two-way ANOVA model by removing the interactions:
Residual standard error: 4.92 on 41 degrees of freedom
Multiple R-squared: 0.179, Adjusted R-squared: 0.118
F-statistic: 2.97 on 3 and 41 DF, p-value: 0.0428
The fitted cell means for the no-interaction model are not in general equal to the
directly calculated sample cell means, because the fitted means are based on
only four parameters:
In problems with all the factors and no numeric predictors, it is usual for the
analysis to center on examination of the estimated cell means under various
models and on tests summarized in an ANOVA table. We will discuss testing in
Section 4.4.
Plots of cell means in R can be conveniently constructed by the
interaction.plot function. In the graph for Moore and Krupat’s data in Figure
4.4, we also show the data values themselves, and so this graph provides an
alternative visualization of the data to the boxplots in Figure 4.3.
Figure 4.4 Conformity by authoritarianism and partner’s status, for Moore and Krupat’s data.
The interaction.plot command creates the basic graph, with the means
for low partner’s status given by empty circles, pch=1 (where the argument
pch specifies the plotting character to be used), and those for high partner’s
status given by filled circles, pch=16 . Setting ylim=range(conformity)
leaves room in the graph for the data. Setting leg.bty="o" puts a box
around the legend.
The points function adds the data points to the graph, using "L" and "H"
for low- and high-status partners, respectively; jitter adds a small random
quantity to the horizontal coordinates of the points, to avoid overplotting.
Two points in the upper right of Figure 4.3, in the low-status, high-
authoritarianism group, have conformity scores that are much higher than those
of the other six members of their group; we could use the identify function to
find their case numbers. These points were not identified as outliers in the
boxplot of the data, but they jointly increase both the mean and the standard
deviation of conformity in their group.
We can use the cbind function to show the coefficient estimates and confidence
limits simultaneously:
> cbind(Estimate=coef(prestige.mod.1), confint(prestige.mod.1))
4.3.3 EFFECT DISPLAYS FOR LINEAR MODELS
In models with no interactions or transformations, coefficient estimates can
provide a good summary of the dependence of a response on predictors.
Transformation and interactions can make parameters much more difficult to
interpret. Moreover, certain kinds of models, such as those using regression
splines (introduced later in this section), are essentially impossible to understand
directly from the coefficient estimates.
Effect displays (Fox, 1987, 2003) are tables or, more commonly, graphs of
fitted values computed from an estimated model that allow us to see how the
expected response changes with the predictors in the model. Effect displays are
an alternative to interpreting linear models directly from the estimated
coefficients, which is often a difficult task when the structure of the model is
complicated.
We typically use the following strategy to construct effect displays: Identify
the high-order terms in the model—that is, terms that are not marginal to others.
For example, in a model of the form y ~ x*a + x*b , which includes the terms 1
for the regression constant, x , a , b , x:a , and x:b , the high-order terms are the
interactions x:a and x:b . You can imagine for this example that x is a numeric
predictor and a and b are factors, but the essential idea is more general.
To form an effect display, we allow the predictors in a high-level term to range
over the combinations of their values, while the other regressors in the model are
held to typical values. The default for the effect function in the effects package
is to replace each regressor by its mean in the data, which is equivalent to setting
each numerical predictor to its mean and each transformed predictor to the mean
of its transformed values. For factors, this choice of a typical value averages
each of the contrasts that represent the factor, which in a linear model is
equivalent to setting the proportional distribution of the factor to the distribution
observed in the data; this approach gives the same results regardless of the
contrasts used to represent the factor. We proceed to compute the fitted value of
the response for each such combination of regressor values. Because effect
displays are collections of fitted values, it is straightforward to estimate the
standard errors of the effects.
The effect function can produce numeric and graphical effect displays for
many kinds of models, including linear models. For example, we have seen (p.
164) how the effect function can be used to calculate adjusted means for a
factor in an analysis of covariance. Applied to a one-way ANOVA model,
effect recovers the group means. We illustrate with the one-way ANOVA fit to
Baumann and Jones’s data on methods of teaching reading (described in Section
4.2.3), producing Figure 4.5:
> plot(effect("group", baum.mod.1))
The first argument to effect is the quoted name of the effect that we wish to
compute, here "group" ; the second argument is the fitted model object,
baum.mod.1 . The effect function returns an effect object, for which the plot
function has a suitable method. It is also possible to print or summarize effect
objects to produce tables of effects, as we have seen in the case of adjusted
means (p. 164). The broken lines in the effect plot show 95% confidence
intervals around the means, computed using the estimated error variance from
the fitted model—that is, assuming equal within-group variances. Because the
numbers of subjects in the three levels of group are equal, the confidence
intervals are all the same size.
Figure 4.5 The effect display for the one-way ANOVA model fit to the Baumann data set simply recovers
the group means. The broken lines give 95% confidence intervals around the means.
The argument ask=FALSE causes the plot function to show the graphs for all
three predictors in the same display, while the argument default.levels=50
asks the effect function, which is called by allEffects , to evaluate the effects
at 50 evenly spaced values across the range of each numeric predictor, rather
than the default 10, producing a smoother fitted curve for income . The broken
lines in the panels of the display represent pointwise 95% confidence intervals
around the fits. The rug-plot at the bottom of each panel represents the marginal
distribution of the corresponding predictor.
The graphs in Figure 4.6 are for the predictors in the model, not for the
corresponding regressors. The second plot, therefore, has the predictor income
rather than log2(income) on the horizontal axis, displaying a fitted curve in the
original income scale rather than a straight line in the transformed scale. Both
income and education influence prestige considerably, as is clear from the
wide ranges of the vertical scales of the effect plots, and the small pointwise
confidence envelopes around the fitted lines reflect the relative precision of
estimation of these effects. The response prestige , on the other hand, is not
strongly related to women , as is similarly clear from the vertical axis of the effect
display; the pointwise confidence envelope around the fitted line for women is
wide compared to the slope of the line, and indeed, a horizontal line would fit
inside the confidence band, reflecting the fact that the coefficient for women may
not differ from 0.
Figure 4.6 Effect displays for the predictors in the multiple linear regression of prestige on education,
log income, and women .
In a two-way ANOVA model with interactions, the effect display for the
interaction term recovers the cell means. For an additive model, however, we
obtain adjusted means for each factor. We use the additive model fit to Moore
and Krupat’s conformity data in Section 4.2.3 to illustrate, producing Figure 4.7:
> plot(allEffects(mod.moore.2), ask=FALSE)
Figure 4.8 Effect plot for fcategory:partner.status in the additive model fit to the Moore data.
> plot(effect("fcategory:partner.status", mod.moore.2))
Warning message:
In analyze.model(term, mod, xlevels, default.levels) : fcategory:partner.status does not
appear in the model
The profiles of the means at the two levels of partner.status are parallel,
reflecting the additivity of the two predictors in the model. The effect function
reports a warning because the fcategory:partner.status interaction is not in
the model.
Figure 4.9 Effect displays for the interactions between education and type and between income and type
for the Prestige data.
Consider the more complex model fit to the Prestige data in Section 4.2.3,
with the numeric predictors log2(income) and education and the factor type of
occupation, in which there are interactions between income and type and
between education and type . Here, the high-order terms are the two sets of
interactions (see Figure 4.9):
> plot(allEffects(prestige.mod.2, default.levels=50), ask=FALSE,
+ layout=c(1, 3))
The effect display for the income -by-type interaction, for example, which
appears at the left of Figure 4.9, is computed at the average level of education .
The lattice-graphics layout argument is used to arrange the panels in each effect
display into one column and three rows.13
REGRESSION SPLINES
Displaying fitted models graphically is especially important when the
coefficients of the model don’t have straightforward interpretations. This issue
arises in models with complex interactions, for example, as it does in models
with polynomial regressors. The problem is even more acute in models that use
regression splines.
Although polynomials may be used to model a variety of nonlinear
relationships, polynomial regression is highly nonlocal: Data in one region of the
predictor space can strongly influence the fit in remote regions, preventing the
fitted regression from flexibly tracing trends through the data. One solution to
this problem is nonparametric regression, which doesn’t make strong
assumptions about how the mean of y changes with the xs.14 Another solution is
to use regression splines.
Regression splines are piecewise polynomials fit to nonoverlapping regions
dissecting the range of a numeric predictor. The polynomials—typically cubic
functions—are constrained to join smoothly at the points where the regions
meet, called knots, and further constraints may be placed on the polynomials at
the boundaries of the data. Regression splines often closely reproduce what one
would obtain from a nonparametric regression, but they are fully parametric. The
coefficients of the regression-spline regressors, however, do not have easily
understood interpretations.
The standard splines package in R provides two functions for fitting
regression splines, bs , for so-called B-splines, and ns for natural splines. We
will illustrate the use of regression splines with data from the 1994 Canadian
Survey of Labour and Income Dynamics (the SLID) for the province of Ontario;
the data are in the data frame SLID in the car package:
> some(SLID)
> nrow(SLID)
[1] 7425
The missing values for wages primarily represent respondents to the survey who
have no earnings. We perform a regression of the log of individuals’ composite
hourly wages on their years of education , age , and sex , restricting our
attention to those between age 18 and 65 with at least 6 years of education. The
numeric predictors education and age are modeled using natural splines with 6
df each. Natural splines with 6 df have four knots, dividing each predictor into
five regions.
It is unsurprising in a sample this large that all the terms in the model are highly
statistically significant; the model as a whole accounts for a respectable 37% of
the variation in log(wages) .15
There are two obstacles to interpreting the coefficients in the model summary:
(1) the response variable is expressed on the log scale and (2) the B-spline
regressors do not have a simple description. Instead, we construct effect plots for
the terms in the model (shown in Figure 4.10):
Figure 4.10 Effect displays for the predictors in the SLID regression of log(wages) on sex , education ,
and age , using 6-df B-splines for education and age .
The xlevels argument to allEffects sets the values of the predictors at which
effects are computed; the default is to use 10 evenly spaced values across the
range of each predictor. The given.values argument fixes the sex dummy
regressor to 0.5 , representing a group composed equally of males and females;
the default is to use the mean of the dummy regressor, which is the proportion of
males in the data. The transformation argument serves to relabel the vertical
axis on the dollar rather than log-dollar scale. Finally, the ylab argument to the
plot function provides a title for the vertical axis. The effect function, called
by allEffects for each term in the model, reports warnings about the
computations. Regressors such as natural splines and orthogonal polynomials
have bases that depend on the data. In cases like this, there can be “slippage” in
computing the effects, as reflected in the warnings.
The effect plots in Figure 4.10 suggest that the partial relationship between
log(wages) and age is nearly quadratic, while the relationship between
log(wages) and education is not far from linear. We therefore fit an alternative
model to the data:
Figure 4.11 Effect displays for the predictors in the SLID regression of log(wages) on sex , education ,
and age , using a linear term for education and a quadratic for age .
[1] 261
This data set has two predictors, t1 and t2 , the number of transactions of two
types performed by the branches of a large bank, to account for the response
variable, time , the total minutes of labor in the bank.
For brevity, we skip the crucial step of drawing graphs of the data and begin
by regressing time on t1 and t2 :
> summary(trans.mod <− lm(time ~ t1 + t2, data=Transact))
Call:
lm(formula = time ~ t1 + t2, data = Transact)
In our example, the sandwich estimate of the coefficient covariance matrix has
substantially larger diagonal entries than the usual estimate.
Approximate Wald t tests for the coefficients can be obtained using the
coeftest function in the lmtest package:
> library(lmtest)
> coeftest(trans.mod, vcov=hccm)
t test of coefficients:
The values of the t statistics are reduced by about 40% for the slopes and 15%
for the intercept compared to the t values based on the constant variance
assumption. The sandwich estimate of the coefficient covariance matrix can be
used with functions in the car package including deltaMethod (described in
Section 4.4.6), linearHypothesis (Section 4.4.5), and Anova (Section 4.4.4).
1. Refit the regression with modified data, obtained by sampling from the rows
of the original data with replacement. As a consequence, some of the rows
in the data will be sampled several times and some not at all. Compute and
save summary statistics of interest from this bootstrap sample.
2. Repeat Step 1 a large number of times, say B = 999.
The outer ellipse in this graph is a 95% joint confidence region for the
population regression coefficients β1 and β2: With repeated sampling, 95% of
such ellipses will simultaneously include β1 and β2, if the fitted model is correct
and normality holds. The orientation of the ellipse reflects the negative
correlation between the estimates. Contrast the 95% confidence ellipse with the
marginal 95% confidence intervals, also shown on the plot. Some points within
the marginal intervals—with larger values for both of the coefficients, for
example—are implausible according to the joint region. Similarly, the joint
region includes values of the coefficient for income , for example, that are
excluded from the marginal interval. The inner ellipse, generated with a
confidence level of 85% and termed the confidence-interval-generating ellipse,
has perpendicular shadows on the parameter axes that correspond to the marginal
95% confidence intervals for the coefficients.
In addition to confidence ellipses for estimates, we can also plot data ellipses.
When variables have a multivariate-normal distribution, data ellipses and
ellipsoids represent estimated probability contours, containing expected fractions
of the data. The dataEllipse function in the car package draws data ellipses for
a pair of variables. We illustrate with income and education in Duncan ’s
occupational-prestige data:
The result can be seen in Figure 4.12b. The contours are set to enclose 50%,
75%, 90%, and 95% of bivariate-normal data. Three observations identified with
the mouse—representing ministers, railroad conductors, and railroad engineers
—are outside the 95% normal contour. Recall that to exit from the identify
function in Windows you must right-click and select Stop;in Mac OS X, press
the esc key.
The 95% ellipses in the two panels of Figure 4.12 differ in shape by only a 90°
rotation, because the data ellipse is based on the sample covariance matrix of the
predictors, while the confidence ellipse is based on the sample covariance matrix
of the slope coefficients, which is proportional to the inverse of the sample
covariance matrix of the predictors.
Figure 4.12 (a) 95% (outer) and 85% (inner) joint confidence ellipses for the coefficients of income and
education in Duncan’s regression of prestige on these predictors, augmented by the marginal 95%
confidence intervals (broken lines). (b) 50%, 75%, 90%, and 95% data ellipses for income and education .
In (b), three observations were identified interactively with the mouse.
tval.repwt pval.repwt
−3.4280342 0.0007535
Once again, the null hypothesis is clearly rejected. For a test such as this of a
single coefficient, the F statistic is just t2 from the corresponding Wald test.
The standard anova function is a generic function, with methods for many
classes of statistical models.21 In the next chapter, we will use the anova function
to obtain likelihood ratio chi-square tests for GLMs.
The first two of these hypothesis tests are generally not of interest, because the
test for education fails to control for log2(income) and type , while the test for
log2(income) fails to control for type . The third test does correspond to a
sensible hypothesis, for equality of intercepts of the three levels of type —that
is, no effect of type controlling for education and log2(income) .
A more subtle difference between the sequential tests produced by the anova
function applied to a single model and the incremental F test produced when
anova is applied to a pair of nested models has to do with the manner in which
the error variance σ2 in the denominator of the F statistics is estimated: When we
apply anova to a pair of nested models, the error variance is estimated from the
residual sum of squares and degrees of freedom for the larger of the two models,
which corresponds to the alternative hypothesis. In contrast, all the F statistics in
the sequential table produced when anova is applied to a single model are based
on the estimated error variance from the full model—prestige ~ education +
log2(income) + type in our example—which is the largest model in the
sequence. Although this model includes terms that are extraneous to (i.e.,
excluded from the alternative hypothesis for) all but the last test in the sequence,
it still provides an unbiased estimate of the error variance for all the tests. It is
traditional in formulating an ANOVA table to base the estimated error variance
on the largest model fit to the data.
Response: prestige
In each case, we get a test for adding one of the predictors to a model that
includes all the others. The last of the three tests is identical to the third test from
the sequential ANOVA, but the other two tests are different and are more likely
to be sensible. Because the Type II tests for log2(income) and education are
tests of terms that are each represented by a single coefficient, the reported F
statistics are equal to the squares of the corresponding t statistics in the model
summary.
Tests for models with factors and interactions are generally summarized in an
ANOVA table. We recommend using the Type II tests computed by default by
the Anova function: for example,
Type II ANOVA obeys the marginality principle (Nelder, 1977), summarized in
Table 4.1 for the example. All the tests compare two models. For example, the
test for log2 (income) compares a smaller model consisting of all terms that do
not involve log2(income) with a model that includes all this plus log2
(income) . The general principle is that the test for a lower-order term, for
example, a main effect (i.e., a separate, partial effect), such as log2(income) , is
never computed after fitting a higher-order term, such as an interaction, that
includes the lower-order term—as log2(income):
Table 4.1 Type II tests performed by the Anova function for the model prestige ~ education*type +
log2(income)*type .
type includes log2(income) . The error variance for all the tests is estimated from the full model—that is,
the largest model fit to the data. As we mentioned in connection with the anova function, although it would
also be correct to estimate the error variance for each test from the larger model for that test, using the
largest model produces an unbiased estimate of σ2 even when it includes extraneous terms.
If the regressors for different terms in a linear model are mutually orthogonal
(i.e., uncorrelated), then Type I and Type II tests are identical. When the
regressors are not orthogonal, then Type II tests address more generally sensible
hypotheses than Type I tests.
UNBALANCED ANOVA
How to formulate hypotheses, contrasts, and sums of squares in unbalanced
two-way and higher-way ANOVA is the subject of a great deal of controversy
and confusion. For balanced data, with all cell counts equal, none of these
difficulties arise. This is not the place to disentangle the issue, but we will
nevertheless make the following brief points:
To get sensible Type III tests, we should change the factor coding, which
requires refitting the model; one approach is as follows:
> contrasts(Moore$fcategory) <-
+ contrasts(Moore$partner.status) <− "contr.sum"
> moore.mod.1 <− update(moore.mod)
> Anova(moore.mod.1, type="III")
In this instance, the Type II and Type III tests produce similar results.23 The
reader may wish to verify that repeating the analysis with Helmert contrasts
(contr.helmert ) gives the same Type III results, while the output using the
default contr.treatment is different.
Before proceeding, we return the contrasts for fcategory and
partner.status to their default values:
E( y|X) = Xβ
or, equivalently,
y = Xβ + ε
where y is an n × 1 vector containing the response; X is an n×( k + 1) model
matrix, the first column of which usually contains 1s; β is a ( k +1) ×1 parameter
vector including the intercept; and ε is an n×1 vector of errors. Assuming that X
is of full column rank, the least squares regression coefficients are
b = ( X′X)−1 X′y
All the hypotheses described in this chapter, and others that we have not
discussed, can be tested as general linear hypotheses of the form H0: Lβ = c,
where L is a q×( k + 1) hypothesis matrix of rank q containing prespecified
constants and c is a prespecified q × 1 vector, most often containing 0s. Under
H0, the test statistic
Hypothesis:
t1 − t2 = 0
Although the F statistic for the test is now much smaller, the hypothesis is still
rejected, even without the assumption of constant variance.
If individuals are unbiased reporters of their weight, then the intercept should be
0 and the slope 1; we can test these values simultaneously as a linear hypothesis:
The hypothesis matrix L is just an order-two identity matrix, constructed in R by
diag(2) , while the right-hand-side vector is c = ( 0, 1)′. Even though the
regression coefficients are closer to 0 and 1 when Observation 12 is omitted, the
hypothesis of unbiased reporting is acceptable for the original data set but not for
the corrected data because the estimated error variance is much smaller when the
outlying Observation 12 is deleted, producing a more powerful test.
The linearHypothesis function also has a more convenient interface for
specifying hypotheses using the names of coefficients. Here are alternative but
equivalent ways of specifying the hypotheses considered above for the Duncan
and Davis regressions:
> linearHypothesis(trans.mod, "t1 = t2")
> linearHypothesis(davis.mod, c("(Intercept) = 0", "repwt = 1"))
For a hypothesis such as this one, which includes the intercept, we must write
(Intercept) (i.e., within parentheses), which is the name of the intercept in the
coefficient vector.
The confidence interval for the ratio, and the justification for using the delta
method in general, depend on the large-sample normality of the estimated
regression coefficients. The method, therefore, may not be reliable.
We can compare the solution from the delta method with the solution from the
bootstrap for this example by reusing the bootstrap samples that we computed
earlier for the transaction regression (in Section 4.3.7):
The delta method is optimistic in this example, giving an estimated standard
error that is too small and therefore a confidence interval that is too short. The
problem here is that the assumption of constant error variance is doubtful;
consequently, the estimated variances of the regression coefficients are too
small, and so the estimated variance of their ratio is also too small.
The deltaMethod function has an optional argument, vcov , which can be set
either to a function that, when applied to the model object, returns a coefficient
covariance matrix or to a covariance matrix itself. If we use a heteroscedasticity-
consistent estimate of the coefficient covariance matrix (provided by the hccm
function in the car package, and discussed in Section 4.3.6), we get the
following:
which gives an estimated standard error much closer to the bootstrap solution.
The second argument to deltaMethod is a quoted expression giving the
function of parameters that is to be estimated. In all the examples in this section,
we have not, strictly speaking, used the names of the parameters,but rather, we
have used the names of the corresponding regressors. The convenience of using
regressor names in place of parameter names extends to GLMs (discussed in the
next chapter). We can, however, use the parameter names, if we wish—where b0
is the name of the intercept and the remaining parameters are named b1 , b2 , …
—in the same order as the parameters appear in the model. Thus, the call to
deltaMethod for the transactions data could have been written as
deltaMethod(trans.mod, "b1/b2") . This latter definition of parameter names
is more general in that it applies to nonlinear models and other models in R
where there isn’t a simple one-to-one correspondence between parameters and
regressors.
[1] 397
Before looking at the salaries, we examine the numbers of male and female
faculty in the college by discipline and rank :
> ftable(x1 <− xtabs(~ discipline + rank + sex, data=Salaries))
A formula determines the horizontal and vertical axes of the graph. The variables
to the right of the | (vertical bar) define the different panels of the plot, and so
we will have six panels, one for each combination of discipline and rank . The
groups argument gives a grouping variable within each panel, using different
symbols for males and females. The type argument specifies printing a grid ("g"
), showing the individual points ("p" ), and displaying a least-squares regression
line ("r" ) for each group in each panel of the plot. Finally, the auto.key
argument prints the legend at the top of the plot.
Figure 4.13 The college salary data.
The specification rot=90 in the scales argument rotates the tick-mark labels for
the horizontal and vertical axes by 90◦. The panels in the graph are organized so
that the comparisons of interest, between males and females of the same
discipline and rank, are closest to each other. Relatively small differences
between males and females are apparent in the graphs, with males generally a bit
higher in salary than females. Discipline effects are clearer in this graph, as is the
relatively large variation in male professors’ salaries in both disciplines.
We now turn to the problem of assessing the difference between males’ and
females’ salaries. In the area of wage discrimination, it is traditional to assess
differences using the following paradigm:
[1] −4551
We first define a selector variable that has the value TRUE for females and FALSE
for males. We then fit a regression to the males only, allowing for interactions
between rank and discipline and an additive yrs.since.phd effect, which we
expect will be small in light of the earlier graphical analysis. We use the generic
function predict to get predictions from the fitted regression. The newdata
argument tells the function to obtain predictions only for the females. If we had
omitted the second argument, then predictions would have been returned for the
data used in fitting the model, in this case the males only, which is not what we
want. We finally compute the mean difference between the observed and
predicted salaries for the females. The average female salary is therefore $4,551
less than the amount predicted from the fitted model for males.
A question of interest is whether or not this difference of −$4, 551 is due to
chance alone. Many of the usual ideas of significance testing are not obviously
applicable here. First, the data form a complete population, not a sample, and so
we are not inferring from a sample to a larger population. Nevertheless, because
we are interested in drawing conclusions about the social process that generated
the data, statistical inference is at least arguably sensible. Second, a sex effect
can be added to the model in many ways, and these could all lead to different
conclusions. Third, the assumptions of the linear-regression model are unlikely
to hold; for example, we noted that professors are more variable than others in
salary .
Here is a way to judge statistical significance based on simulation: We will
compare the mean difference for the real minority group, the 39 females, with
the mean difference we would have obtained had we nominated 39 of the faculty
selected at random to be the “minority group.” We will compute the mean
difference between the actual and predicted salaries for these 39 “random
females” and then repeat the process a large number of times, say B = 999:
We first compute the number of females and the number of faculty members. We
employ a for loop (Section 8.3) to perform the calculation repeatedly. Within the
loop, the sample function is used to assign fnumber randomly selected faculty
members to be “females”; the regression is updated without the randomly chosen
“females”; the predictions are obtained; and the average difference between
actual and predicted salary is saved.
There are many ways to summarize the simulation; we have used a histogram
(see Figure 4.15):
> (frac <− round(sum(meanDiff > simDiff)/(1 + B), 3))
0.102
Figure 4.15 Histogram of simulated sex differences for the college salary data. The vertical line is drawn at
the observed mean difference of −4551.
If this paradigm is accepted, then all that remains is, first, to choose the subsets
to examine, and, second, to choose a criterion to optimize.
This first segment of the output shows the result of dropping each predictor in
turn from the current regression model. If the predictor is a factor, then step will
drop as a group all the regressors created from the factor.25 The output provides
the name of the term that is dropped, the change in degrees of freedom, the sum
of squares explained by the dropped term, the residual sum of squares for each
subset-model, and finally the value of a criterion statistic to be used to compare
models. The default criterion is the Akaike information criterion,or AIC.26
Provided that at least one of the subset models has a value of AIC less than that
for the model with all the predictors, marked <none> in the output, the predictor
that corresponds to the smallest value of AIC is dropped, in this case shld .
The next section of output from step is similar to the first section, except that
the now-current model has shld removed:
In the second step, therefore, itg will be removed.
We continue in this fashion, at each step either deleting the predictor that
corresponds to the model with the smallest AIC or stopping if all prospective
deletions increase AIC. We omit several steps, showing only the final two:
The step function stops at this last step because deleting any of the four
remaining predictors would increase the AIC. The backward-elimination
algorithm examined only 49 of the possible 1,023 subset models before
stopping. Because we made an assignment in the call to step shown above,
highway.backward is the final model, selected by backward elimination to
minimize the AIC:
We can compare the fit of the full and subset models by looking at the
corresponding fitted values, as in Figure 4.16:
Figure 4.16 Comparison of fitted values for the Highway data with all the regressors and with the
regressors selected by backward stepwise fitting.
> plot(fitted(highway.mod) ~ fitted(highway.backward))
> abline(0, 1)
> cor(fitted(highway.mod), fitted(highway.backward))
[1] 0.9898
The change in the fitted values is relatively small, and the two sets of fitted
values have correlation .99; we conclude that the subset model and the full
model provide essentially the same information about the value of the response
given the predictors.
With forward selection, specifying direction="forward" in the step
command, we start with the smallest model that we are willing to entertain:
> highway.mod0 <− update(highway.mod, . ~ log2(len))
and then tell step which terms to consider adding to the model using the scope
argument. The step function employs add1 to add predictors to the model one at
a time:
We used the argument trace=0 to suppress printing the steps in full. The forward
and backward algorithms suggest different subset models, but an examination of
the fitted values for these models demonstrates their general predictive
similarity: for example,
> AIC(highway.mod)
[1] 47.07
> AIC(highway.backward)
[1] 37.96
> AIC(highway.forward)
[1] 43.73
unordered ordered
"contr.treatment" "contr.poly"
Two values are provided by the contrasts option, one for unordered factors
and the other for ordered factors, which have levels that are ordered from
smallest to largest. Each entry corresponds to the name of a function that
converts a factor into an appropriate set of contrasts; thus, the default function
for defining contrasts for an unordered factor is contr.treatment .
We can see how the contrasts for a factor are coded by using the contrasts
function:
> with(Prestige, contrasts(type))
The result shows compactly the coding discussed in Section 4.6.1: The first level
of type is coded 0 for both dummy regressors; the second level is coded 1 for
the first and 0 for the second regressor; and the third level is coded 1 for the
second and 0 for the first regressor. As previously explained, it is possible to
change the baseline level using the relevel function:
> Prestige$type1 <− relevel(Prestige$type, ref="prof")
This command creates a new factor with prof as the baseline level:
> with(Prestige, contrasts(type1))
Because dosage has three distinct values—1, 4, and 8—we could treat it as a
factor, coding regressors to represent it in a linear model. It wouldn’t be
appropriate to use contr.poly here to generate the regressors, however, because
the levels of dosage aren’t equally spaced; moreover, because the data are
unbalanced, the regressors created by contr.poly would be correlated. The poly
function will generate orthogonal-polynomial regressors for dosage :
A very simple way to proceed (though not the only way) is to make the
columns of C mutually orthogonal. Then, the rows of [1, C]−1 will be
proportional to the corresponding columns of [1, C], and we can directly code
the contrasts of interest among the means in the columns of C.
None of this requires that the factor have equal numbers of observations at its
several levels, but if these counts are equal, as in the Baumann data set, then not
only are the columns of C orthogonal, but the columns of the model matrix X
constructed from C are orthogonal as well. Under these circumstances, we can
partition the regression sum of squares for the model into 1-df components due
to each contrast.
For the Baumann and Jones data, the two contrasts of interest are (1) Basal
versus the average of DRTA and Strat and (2) DRTA versus Strat :
Call:
lm(formula = post.test.3 ~ group, data = Baumann)
Residual standard error: 6.31 on 63 degrees of freedom
Multiple R-squared: 0.125, Adjusted R-squared: 0.0967
F-statistic: 4.48 on 2 and 63 DF, p-value: 0.0152
The t statistics for the contrasts test the two hypotheses of interest, and so we
have strong evidence that the new methods are superior to the old but little
evidence of a difference in efficacy between the two new methods. The overall F
test, and other similar summaries, are unaffected by the choice of contrasts as
long as we use d − 1 of them and the C matrix has full column rank d − 1.
User-specified contrasts may also be used for factors in more complex linear
models, including multi-factor models with interactions.
4.7.1 NO INTERCEPT
The rule that a factor with d levels requires d − 1 regressors may not hold if
the model formula does not include an intercept:
> summary(prestige.mod.4 <− update(prestige.mod.1, . ~ . − 1))
Call:
lm(formula = prestige ~ education + log2(income) + type − 1, data = na.omit(Prestige))
has 3 df , rather than 2 df , and tests the probably uninteresting hypothesis that all
three intercepts are equal to 0 rather than the more interesting hypothesis that
they are all equal to each other. We suggest that you generally avoid leaving off
the intercept in a linear model unless you have a specific reason for doing so,
and then are very careful to interpret the coefficients and the results of the
statistical tests correctly.
4.7.2 SINGULARITY
In some instances, we may end up trying to fit regressors that are exact linear
combinations of each other in a model. This can happen by accident or due to
confusion, for example, by including scores on subtests and the total score of all
the subtests as predictors. In other instances, two predictors might be included
accidently that are really the same quantity, such as height in centimeters and
height in inches.
Probably the most common situation producing singularity that isn’t simply an
error in specifying the model is fitting a model with two or more factors,
including main effects and all interactions among the factors, but with at least
one of the cells in the cross-classification of the factors empty. Suppose, for
example, that we have two factors, A and B , with a and b levels, respectively.
Then the model y ~ A + B + A:B creates one regressor for the intercept, a − 1
regressors for the main effects of A , b − 1 regressors for the main effects of B ,
and ( a − 1) ( b − 1) regressors for the A:B interactions—that is, a × b regressors
in all. Furthermore, suppose that one cell in the A ×B table is empty, so that there
are only ( a × b) −1 rather than a × b observed cell means. The model now has
one too many regressors—one more than there are observed cell means. It is
only the last regressor that will cause a problem, however, and we should still be
able to contrast the additive model y~A+B with the model that includes
interactions, y ~ A + B + A:B , as long as we remove the redundant interaction
regressor.
As an example, we return to the Moore and Krupat data in Moore , discussed
in Section 4.2.3. We remove all seven observations with fcategory = high and
partner.status = high to create an empty cell:
With the empty cell, there is only 1 df remaining for the interaction, rather than 2
df , although the output still shows two coefficients. The second interaction
coefficient, however, is given a value of NA ,as lm recognizes that this coefficient,
which is said to be aliased, cannot be estimated. Interpretation of the remaining
coefficients with empty cells present depends on the contrasts used to define the
factors and the order of terms in the model, and is therefore not straightforward.
Some functions, such as coef , applied to a model with a singularity return a
coefficient vector with the NA s included. Others, such as anova and Anova ,
correctly recognize the singularity and do the right thing, adjusting degrees of
freedom for the redundant coefficients:
> Anova(mod.moore.3)
These are the correct Type II tests. With empty cells, Type III tests, however, are
very hard to justify, and they are consequently not produced by Anova .
The lm function has several additional useful arguments, and some of the
arguments that we discussed have uses that were not mentioned. The args
function prints out the arguments to lm :
> args(lm)
function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x =
FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, …)
NULL
4.8.1 formula
A formula for lm consists of a left-hand side, specifying the response variable,
and a right-hand side, specifying the terms in the model; the two sides of the
formula are separated by a tilde (~ ). We read the formula a ~ b as “a is modeled
as b ,” or “a is regressed on b .”
The left-hand side of the formula can be any valid R expression that evaluates
to a numeric vector of the appropriate length. On the left side of the formula, the
arithmetic operators, - , + , * , / , and ^ , have their usual meanings, and we can
call whatever functions are appropriate to our purpose. For example, with
reference to Moore and Krupat’s data, we could replace the number of
conforming responses by the percentage of conforming responses,
> lm(100*conformity/40 ~ partner.status*fcategory, data=Moore)
or (using the logit function in the car package) by the log-odds of conformity,
> lm(logit(conformity/40) ~ partner.status*fcategory, data=Moore)
The right-hand side of the model formula may include factors and expressions
that evaluate to numeric vectors and matrices. Because several operators have
special meaning in formulas, arithmetic expressions that use them have to be
protected. Expressions are protected automatically when they are inside function
calls: For example, the + in the term log(a + b) has its usual arithmetic
meaning on the right-hand side of a model formula, even though a + b does not
when it is unprotected.
The identity function I may be used to protect arithmetic expressions in model
formulas. For example, to regress prestige on the sum of education and
income in Duncan’s data set, thus implicitly forcing the coefficients of these two
predictors to be equal, we may write,
> lm(prestige ~ I(income + education), data=Duncan)
We have already described most of the special operators that appear on the
right of linear-model formulas. In Table 4.2 (adapted from Chambers, 1992, p.
29), A and B represent elements in a linear model: numeric vectors, matrices,
factors, or expressions (such as a + b or a*b ) composed from these.
Table 4.2 Expressions used in R formulas.
Thus, a/x fits an intercept; two dummy variables for a , with a = "A" as the
baseline level; and a separate slope for x within each of the levels of a . In
contrast, x:a , or equivalently x %in% a (or, indeed, a:x or a %in% x ), fits
a common intercept and a separate slope for x within each level of a .
The ^ operator builds crossed effects up to the order given in the exponent.
Thus, the example in the table, (a + b + c)^2 , expands to all main effects
and pairwise interactions among a , b , and c : that is, a + b + c + a:b +
a:c + b:c . This is equivalent to another example in the table, a*b*c −
a:b:c .
The intercept, represented by 1 in model formulas, is included in the model
unless it is explicitly excluded, by specifying −1 or 0 in the formula.
4.8.2 data
The data argument ordinarily specifies a data frame for use in fitting the
model. If the data argument is omitted, then data are retrieved from the global
environment, and so objects will be found in the normal manner along the search
path, such as in an attached data frame. We find explicitly setting the data
argument to be a sound general practice (as explained in Section 2.2.2).
4.8.3 subset
The subset argument is used to fit a model to a subset of observations.
Several forms are possible:
A logical vector, as in
4.8.4 weights
If we loosen the constant-variance assumption in Equation 4.2 (p. 150) to
4.8.7 singular.ok*
Under normal circumstances, R builds a full-rank model matrix, removing
redundant dummy regressors, for example. Under some circumstances, however
—perfect collinearity, for example, or when there is an empty cell in an ANOVA
—the model matrix may be of deficient rank, and not all the coefficients in the
linear model will be estimable. If singular.ok is TRUE , which is the default,
then R will fit the model anyway, setting the aliased coefficients of the redundant
regressors to NA (see Section 4.7.2).
4.8.8 contrasts
This argument allows us to specify contrasts for factors in a linear model, in
the form of a list with named elements: for example,
> lm(conformity ~ partner.status*fcategory, data=Moore,
+ contrasts=list(partner.status=contr.sum, fcategory=contr.poly))
4.8.9 offset
An offset is a term added to the right-hand side of a model with no associated
parameter to be estimated—it implicitly has a fixed coefficient of 1. In a linear
model, specifying a variable as an offset is equivalent to subtracting that variable
from the response. Offsets are more useful in GLMs (discussed in Chapter 5)
than in linear models.
The material in this chapter is covered in detail in Fox (2008, Part II) and in
Weisberg (2005, chaps. 2–6).
Model selection, briefly introduced in Section 4.5, is discussed in Fox
(2008, sec. 13.2.2) and Weisberg (2005, chap. 10). The stepwise methods
presented in the Companion may be viewed as rather old-fashioned, with
newer methodology such as penalized fitting and tree methods now
common, particularly in the related field of data mining. For these latter
methods, see Berk (2008) and Hastie et al. (2009).
1How missing data are represented in R is described in Section 2.2.3. When there is more than a small
proportion of missing data, simply ignoring missing values can be seriously problematic. One principled
approach to missing data, multiple imputation, is described in the online appendix to the text.
2If you are unfamiliar with vector notation, simply think of x = ( x , … , x ) as the collection of regressors.
1 k
3Alternatives to using the data argument are discussed in Section 2.2. The full set of arguments for lm is
described in Section 4.8.
4In this Companion, we indicate an estimated regression coefficient by replacing a Greek letter with the
corresponding Roman letter, as in b1 for β1.
5See Section 2.6 for a general discussion of coercion from one type of data to another.
6As explained in Section 2.5.2, we can alter the default behavior of read.table via its col-Classes
argument.
7There are many other ways of producing this test: See, for example, Section 4.4.5 on testing linear
hypotheses.
8The more general use of the effects package for displaying fitted linear models is developed in Section
4.3.3.
9The model prestige ~ education:type + log2(income):type fits a common intercept for the three
occupational groups and different education and log2(income) slopes for the groups, not a generally
sensible specification; see Section 4.8.1.
10Actually, Moore and Krupat categorized authoritarianism separately within each level of partner’s status.
The results we present here are similar to theirs, but our procedure is more defensible.
The reader may want to consider variations on the analysis: Using fscore , the quantitative version of
authoritarianism, in place of the factor fcategory produces a dummy-variable regression. Because
conformity is a disguised proportion (i.e., the number of conforming responses out of 40), a logit or similar
transformation of conformity /40 might be tried. See Section 3.4.3 on transforming restricted-range
variables.
11This graph can be drawn more elegantly using the bwplot function in the lattice package; see Section
7.3.1.
12See ?interaction.plot , ?points ,and ?jitter for more information on each of the functions used in
this example. The construction of graphs in R is described in detail in Chapter 7.
13The layout argument reverses the usual convention for the order of rows and columns. There is more
information about lattice graphics in Section 7.3.1.
14Nonparametric regression in R is described in the online appendix to the text.
15The interactions between sex and education and between sex and age are also statistically significant
2
but relatively small, increasing the R for the model by only about 1%. The general use of the Anova
function for testing linear models is described in Section 4.4.4.
16hccm is an acronym for h eteroscedasticity-c orrected c ovariance m atrix.
17The use of the boot package is described in the online appendix to the text.
18The bootCase function can be used with most regression models in R, but it may not be appropriate in all
instances; see ?bootCase .
19More complicated rules for computing bootstrap confidence intervals that are somewhat better behaved
in practice are discussed in the online appendix to the text.
20See Section 2.2.3 on handling missing data in R.
21Generic functions and their methods are discussed in Sections 1.4 and 8.7.
22* The contrasts produced by contr.sum , contr.helmert ,and contr.poly for different factors are
orthogonal in the row-basis of the model matrix, but those produced by contr.treatment are not.
23The Type III ANOVA table produced by Anova also includes a test that the intercept is 0. This hypothesis
is not usually of interest.
24See Chapter 7 for more on graphics in R.
25The step function described in this section will not violate the marginality principle and thus will not
drop a low-order term if one of its higher-order relatives is included in the model. In the current example,
however, the starting model is additive and therefore does not include terms related by marginality.
26The criterion can be changed with the argument k to step ; for example, to use the Bayesian information
criterion (BIC), we would set k to loge( n). See the help page for step and the references in the
Complementary Reading for details.
27See ?subsets for an example and the Complementary Reading for further discussion.
28The observant reader will notice, however, that the R2s for the two models differ: When the intercept is
suppressed, R calculates R2 based on variation around 0 rather than around the mean of the response,
producing a statistic that does not generally have a sensible interpretation.
Fitting Generalized Linear Models 5
Table 5.2 Canonical or default link, response range, and conditional variance function for generalized-
linear-model families. φ is the dispersion parameter, equal to one in the cases where it is not shown; μ = μ(
x) is the conditional mean of y given the values of the predictors; in the binomial family, N is the number of
trials.
Standard link functions and their inverses are shown in Table 5.1. The identity
link function simply maps the linear predictor to itself, and so it is generally
appropriate only for a distribution like the normal that supports positive and
negative values. The next four link functions can be used only with responses
that are nonnegative, like the gamma distribution or the Poisson. The remaining
three links require the mean to be constrained to ( 0, 1), as is appropriate for
dichotomous or binomial data.
GLMs are typically fit to data by the method of maximum likelihood, using
the iteratively weighted least squares procedure outlined in Section 5.12. Denote
the maximum-likelihood estimates of the regression parameters as b0, b1, …, bk
and the estimated value of the linear predictor as (x) = b0 + b1x1 + … +bkxk.
The estimated mean of the response is ( x) = g−1 [ (x) ].
The variance of distributions in an exponential family is the product of a
positive dispersion (or scale) parameter φ, and a function of the the mean given
the linear predictor:
The variances for the several exponential families are shown in the last column
of Table 5.2. For the binomial and Poisson distributions, the dispersion
parameter φ = 1, and so the variance depends only on μ. For the Gaussian
distribution, V[μ( x)] = 1, and the variance depends only on the dispersion
parameter φ. For Gaussian data, it is usual to replace φ by σ 2,aswehave done for
linear models in Chapter 4. Only the Gaussian family has constant variance, and
in all other GLMs, the conditional variance of y at x depends on μ( x).
The deviance, based on the maximized value of the log-likelihood, provides a
measure of the fit of a GLM to the data, much as the residual sum of squares
does for a linear model. We write p [y; μ( x), φ] for the probability-mass or
probability-density function of a single response y given the predictors x. Then
the value of the log-likelihood evaluated at the maximum-likelihood estimates of
the regression coefficients for fixed dispersion is
The residual deviance is defined as twice the difference between these log-
likelihoods,
Because the saturated model must fit the data at least as well as any other
model, the deviance is never negative. The larger the deviance, the less well the
model of interest matches the data. In families with a known value of the
dispersion parameter φ, such as the binomial and Poisson families, the deviance
provides a basis for testing lack of fit of the model and for other tests that
compare different specifications of the linear predictor. If φ is estimated from the
data, then the scaled deviance D( y; ) / is the basis for hypothesis tests. The
degrees of freedom associated with the residual deviance are equal to the number
of observations n minus the number of estimated regression parameters,
including the intercept if it is in the linear predictor.
Table 5.3 Family generators and corresponding link functions for the glm function. Default links are shown
as • other possible links as ◦.
Most GLMs in R are fit with the glm function. The most important arguments of
glm are formula , family , data , and subset : As for the lm function discussed
in Chapter 4, the response variable and predictors are given in the model
formula , and the data and subset arguments determine the data to which the
model is fit. The family argument is new to glm , and it supplies a family-
generator function, which provides the random component of the model;
additional optional arguments (usually just a link argument— see below) to the
family-generator function specify the link function for the model.
The family-generator functions for the five standard exponential families are
given in Table 5.2. All family names start with lowercase letters, except for
Gamma , which is capitalized to avoid confusion with the gamma function in R.
Each family has its own canonical link, which is used by default if a link isn’t
given explicitly; in most cases, the canonical link is a reasonable choice. Also
shown in the table are the range of the response and the variance function for
each family.
Table 5.3 displays the links available for each family-generator function.
Nondefault links are selected via a link argument to the family-generator
functions: for example, binomial(link=probit) . The quasi , quasibinomial ,
and quasipoisson family generators do not correspond to exponential families;
these family generators are described in Section 5.10. If no family argument is
supplied to glm , then the gaussian family with the identity link is assumed,
resulting in a fit identical to that of lm , albeit computed less efficiently—like
using a sledgehammer to set a tack.
We begin by considering data in which each case provides a binary response, say
“success” or “failure”; the cases are independent; and the probability of success
μ( x) is the same for all cases with the same values x of the regressors. In R, the
response may be either a variable or an R expression that evaluates to 0 (failure)
or 1 (success); a logical variable or expression (with TRUE representing success
and FALSE , failure); or a factor, in which case the first category is taken to
represent failure and the others success. We will consider the more general
binomial responses, where the response is the number of successes in one or
more trials, in the next section.
Figure 5.1 Comparison of logit, probit, and complementary log-log links. The probit link is rescaled to
match the variance of the logistic distribution, Π2/3.
When the response is binary, we think of the mean function, μ( x), as the
conditional probability that the response is a success given the values x of the
regressors. The most common link function used with binary-response data is the
logit link, for which
The quantity on the left of Equation 5.1 is called the logit or the log-odds, where
the odds is the probability of success divided by the probability of failure.
Solving for μ( x) gives the mean function,
Other link functions, which are used less often, include the probit and the
complementary log-log links. These three links are drawn as functions of the
linear predictor η (x) in Figure 5.1. The logit and probit links are very similar,
except in the extreme tails, which aren’t well resolved in a graph of the link
functions, while the complementary log-log has a different shape and is
asymmetric, approaching μ( x) = 1 more abruptly than μ( x) = 0.
The binomial model with the logit link is often called the logistic-regression
model because the inverse of the logit link (see Table 5.1) is the logistic function.
The name logit regression or logit model is also used.
Table 5.4 Variables in the Mroz data set.
The only features that differentiate this command from fitting a linear model are
the change of function from lm to glm and the addition of the family argument.
The family argument is set to the family-generator function binomial . The first
argument to glm , the model formula , specifies the linear predictor for the
logistic regression, not the mean function directly, as it did in linear regression.
Because the link function is not given explicitly, the default logit link is used; the
command is therefore equivalent to
The model summary for a logistic regression is very similar to that for a linear
regression:
The Wald tests, given by the ratio of the coefficient estimates to their standard
errors, are now labeled as z value s because the large-sample reference
distribution for the tests is the normal distribution, not the t distribution as in a
linear model. The dispersion parameter, φ = 1, for the binomial family is noted in
the output. Additional output includes the null deviance and degrees of freedom,
which are for a model with all parameters apart from the intercept set to 0; the
residual deviance and degrees of freedom for the model actually fit to the data;
and the AIC, an alternative measure of fit sometimes used for model selection
(see Section 4.5). Finally, the number of iterations required to obtain the
maximum-likelihood estimates is printed.3
where the left-hand side of the equation, ( x) / [1 − ( x) ], gives the fitted odds
of success—that is, the fitted probability of success divided by the fitted
probability of failure. Exponentiating the model removes the logarithms and
changes it from a model that is additive in the log-odds scale to one that is
multiplicative in the odds scale. For example, increasing the age of a woman by
1 year, holding the other predictors constant, multiplies the odds of her being in
the workforce by exp( b3) = exp( −0.06287) = 0.9391—that is, reduces the odds
of her working by 6%. The exponentials of the coefficient estimates are
generally called risk factors (or odds ratios), and they can be viewed all at once,
along with their confidence intervals, by the command
Compared with a woman who did not attend college, for example, a college-
educated woman with all other predictors the same has odds of working about
2.24 times higher, with 95% confidence interval 1.43 to 3.54.
The confint function provides confidence intervals for GLMs based on
profiling the log-likelihood rather than on the Wald statistics used for linear
models (Venables and Ripley, 2002, sec. 8.4). Confidence intervals for GLMs
based on the log-likelihood take longer to compute but tend to be more accurate
than those based on the Wald statistic. Even before exponentiation, the log-
likelihood-based confidence intervals need not be symmetric about the estimated
coefficients.
The test statistic is the change in deviance between the two fitted models, and
the p value is computed by comparing this value with the chi-square distribution
with df equal to the change in the degrees of freedom for the two models. For the
example, the change in deviance is 66.5 with 2 df , reflecting the two regressors
removed from the model; when compared with the χ2( 2) distribution, we get a p
value that is effectively 0. That the probability of a woman’s participation in the
labor force depends on the number of children she has is, of course,
unsurprising. Because this test is based on deviances rather than variances, the
output is called an analysis of deviance table.
Applied to GLMs, the anova function does not by default compute any
significance tests. To obtain likelihood ratio chi-square tests for binary-
regression models and other GLMs, we have to include the argument test=
"Chisq" . For GLMs with a dispersion parameter estimated from the data
(discussed, for example, in Section 5.10.4), specifying test="F" produces F
tests in the analysis-of-deviance table.
As with linear models (see Section 4.4.3), the anova function can be used to
compute a Type I or sequential analysis-of-deviance table, and as in linear
models, these tables are rarely useful. We instead recommend using the Type II
tests described in the next section.
Each line of the analysis-of-deviance table provides a likelihood ratio test based
on the change in deviance when comparing the two models. These tests are
analogous to the corresponding Type II tests for a linear model (Section 4.4.4).
For an additive model in which all terms have 1 df —as is the case in the current
example—the Type II likelihood ratio statistics test the same hypotheses tested
by the Wald statistics in the summary output for the model. Unlike in linear
models, however, the Wald tests are not identical to the corresponding likelihood
ratio tests. Although they are asymptotically equivalent, in some circumstances
—but not in our example—the two approaches to testing can produce quite
different significance levels. The likelihood ratio test is generally more reliable
than the Wald test.
The Anova function is considerably more flexible than is described here,
including options to compute Type II Wald tests that are equivalent to the z tests
from the summary output; to compute F tests for models with an estimated
dispersion parameter; and to compute Type III tests (details are provided in
Section 5.10.1).
Fitted values on the scale of the response can also be obtained with the fitted
function.
Table 5.5 Voter turnout by perceived closeness of the election and intensity of partisan preference, for the
1956 U.S. presidential election. Frequency counts are shown in the body of the table. Source: Campbell et
al. (1960, Table 5-3).
We use the rep command to simplify entering data that follow a pattern (see
Section 2.1.1) and then collect the data into a data frame. We verify that the data
are correct by comparing the data frame with Table 5.5:
The final variable in the data frame, and the last column of Table 5.5, show
the sample logit, loge( Voted/Did Not Vote), for each combination of categories
of the predictors. These logits are computed and saved so that they can be
graphed in Figure 5.2, much as one would graph cell means in a two-way
ANOVA when there are two factors as predictors and the response variable is
quantitative. Voter turnout appears to increase with intensity of preference but
much more dramatically when the election is perceived to be close than when it
is perceived to be one-sided.
Figure 5.2 was drawn using the following commands:
We use the par function to set global graphical parameters, in this case allowing
extra room at the right for an additional axis. By assigning the result of par to a
variable, we can call par again to reset the margins to their original values. An
alternative is simply to close the plot device window before creating a new
graph.4 The interaction.plot function (which we previously encountered in
Section 4.2.3) does most of the work. It is designed to work with two factors:
The first is put on the horizontal axis, and the second defines a grouping
variable. The mean value of the third variable is plotted for each combination of
the two factors. In this case, there is no mean to compute— there is only one
observed logit per combination—so we changed the label on the y-axis to reflect
this fact. The argument type="b" asks interaction.plot to plot b oth lines and
points; cex=2 (c haracter ex pansion) draws the points double-size; and
pch=c(1, 16) specifies the p lotting ch aracters 1 (an open circle) and 16 (a
filled circle) for the two levels of closeness . We used the probabilityAxis
function from the car package to draw a right-side probability axis, which is a
nonlinear transformation of the scale of the linear predictor, the logit scale,
shown at the left.
Figure 5.2 Voter turnout by perceived closeness of the election and intensity of partisan preference.
and all residuals equal to 0. Consequently, the residual deviance is also 0 within
rounding error.
The no-interaction model corresponds to parallel profiles of logits in the
population. We can test for interaction either with the anova function, comparing
the saturated and no-interaction models, or with the Anova function:
The test for interaction is the same for both commands. The alternative
hypothesis for this test is the saturated model with 0 residual deviance, and so
the likelihood ratio test statistic could also be computed as the residual deviance
for the fit of the no-interaction model:
This last approach doesn’t automatically provide a significance level for the test,
but this could be obtained from the pchisq function.
The Anova command also provides Type II tests for the main effects, but if we
judge the interaction to be different from zero, then these tests, which ignore the
interaction, should not be interpreted.5
Rather than fitting a binomial logit model to the contingency table, we could
alternatively have fit a binary logit model to the 1,275 individual observations
comprising Campbell et al.’s data. Let’s generate the data in that form.
Manipulating data frames in this manner is often complicated, even in relatively
small examples like this one:
We abbreviated the names of the original predictors to distinguish the variables
in the new data set from those in Campbell , and now check that we have
correctly generated the data by rebuilding the contingency table for the three
variables:
The xtabs function creates the three-way contingency table, and then ftable
flattens the table for printing.
We proceed to fit a binary logistic-regression model to the newly generated
Campbell.long data:
As is apparent from this example, the two approaches give identical
coefficient estimates and standard errors, and identical values of tests based on
differences in residual deviance. The residual deviance itself, however, is
different for the two models, because the binomial-logit model campbell.mod is
fit to six cells while the binary-logit model campbell.mod.long is fit to 1,275
individual observations. The saturated model fit to the data summarized in the
six cells is not the saturated model fit to the 1,275 individual observations.
When, as here, the predictors are discrete and thus divide the data into groups,
there is an advantage in fitting the binomial-logit model: The residual deviance
for any unsaturated model, such as the no-interaction model in the example, can
be interpreted as a test of lack of fit. We can equivalently test lack of fit for a
binary-logit model with discrete predictors by formulating a model with all the
interactions among the predictors treated as factors—in the example, the model
with the closeness:preference interactions and the main effects marginal to
them—and using that as a point of comparison.
5.5 Poisson GLMs for Count Data
Poisson GLMs arise in two distinct contexts. The first, covered in this section, is
more straightforward, in which the conditional distribution of the response
variable given the predictors follows a Poisson distribution. The second,
presented in the next section, is the use of loglinear models for analyzing the
associations in contingency tables. In most instances, the cell counts in
contingency tables have multinomial, not Poisson, conditional distributions, but
it turns out that with appropriate interpretation of parameters, the multinomial
maximum-likelihood estimators can be obtained as if the counts were Poisson
random variables. Thus, the same Poisson-GLM approach can be used for fitting
Poisson-regression models and loglinear models for contingency tables.
The default link for the poisson family generator is the log link; all the
models discussed in this section and the next use the log link.
Recall Ornstein’s data (Ornstein, 1976) on interlocking director and top
executive positions in 248 major Canadian firms (introduced in Chapter 3):
Figure 5.3 Distribution of number of interlocks maintained by 248 large Canadian corporations.
[1] 248
The numbers on top of the frequencies are the different values of interlocks :
Thus, there are 28 firms with 0 interlocks, 19 with 1 interlock, 14 with 2
interlocks, and so on. The graph is produced by plotting the counts in tab against
the category labels converted into numbers:
The Type II analysis of deviance, produced by the Anova function in the car
package, shows that all three predictors have highly statistically significant
effects.
The coefficients of the model are interpreted as effects on the log-count scale
(the scale of the linear predictor), and consequently exponentiating the
coefficients produces multiplicative effects on the count scale (the scale of the
response):
Figure 5.4 Effect plots for the terms in the Poisson-regression model fit to Ornstein’s interlocking-
directorate data. The broken lines show point-wise 95% confidence envelopes around the fitted effects.
We thus estimate, for example, that doubling assets (i.e., increasing the log2 of
assets by 1), holding the other predictors constant, multiplies the expected
number of interlocks by 1.37 (i.e., increases expected interlocks by 37%).
Likewise, when compared to a similar Canadian firm, which is the baseline level
for the factor nation , a U.S. firm on average maintains only 46% as many
interlocks.
We can also use the effects package (introduced for linear models in Section
4.3.3) to visualize the terms in a GLM, such as a Poisson regression. For the
model fit to Ornstein’s data:
By default, the vertical axes in these graphs are on the scale of the linear
predictor, which is the log-count scale for Ornstein’s Poisson regression. The
axis tick marks are labeled on the scale of the response, however—number of
interlocks , in our example. Also by default, the range of the vertical axis is
different in the several graphs, a feature of the graphs to which we should attend
in assessing the relative impact of the predictors on interlocks .7 The argument
default.levels=50 produces a smoother plot, setting assets to 50 distinct
values across its range rather than the default 10; the rug-plot at the bottom of
the effect display for assets shows the marginal distribution of the predictor.
The levels of the factor sector are ordered alphabetically; the effect plot for this
factor would be easier to read if we rearranged the levels so that they were in the
same order as the effects, as happened accidentally for the factor nation . The
plot for assets is curved because the model used log2(assets) as the regressor.
About 95% of the values of assets are less than 20,000 (corresponding to $20
billion), so the right portion of this plot may be suspect.
type of institution, a factor in which Levels I(Pu) and I(Pr) are math
departments in high-quality public and private universities, Levels II and
III are math departments in progressively lower-quality universities, Level
IV represents statistics and biostatistics departments, and Level Va refers to
applied mathematics departments.
sex of the degree recipient, with levels Female and Male .
citizen , the citizenship of the degree recipient, a factor with levels US and
Non-US .
count , the number of individuals for each combination of type , sex , and
citizen —and, consequently, for each combination of type and class .
The data in AMSsurvey have one row for each of the cells in the table, thus 6
× 2 × 2 = 24 rows in all.
For example, 260 female PhD recipients were non-U.S. citizens in this period.
The usual analysis of a two-dimensional table such as this consists of a test of
independence of the row and column classifications:
The test statistic is the uncorrected Pearson’s X 2, defined by the familiar formula
where the sum is over all cells of the table, O are the observed cell counts, and E
are the estimated expected cell counts computed under the assumption that the
null hypothesis of independence is true. Approximate significance levels are
determined by comparing the value of X 2 to a χ2 distribution with degrees of
freedom depending on the number of cells and the number of parameters
estimated under the null hypothesis. For an r × c table, the degrees of freedom
for the test of independence are ( r − 1) ( c − 1)—here, ( 2 − 1)(2 − 1) = 1 df .
The chisq.test function can also estimate the significance level using a
simulation, an approach that is preferred when the cell counts are small. Using
the default argument correct=TRUE would produce a corrected version of X 2
that is also more accurate in small samples.
The p value close to .1 suggests at best weak evidence that the proportion of
women is different for citizens and non-citizens, or, equivalently, that the
proportion of non-citizens is different for males and females.
A loglinear model can also be fit to the two-way contingency table by
assuming that the cell counts are independent Poisson random variables. First,
we need to change the two-way table into a data frame:
The test based on the deviance is a likelihood ratio test of the hypothesis of
independence, while Pearson’s X 2 is a score test of the same hypothesis. The
two tests are asymptotically equivalent, but in small samples, Pearson’s chi-
square can give more accurate inferences. In general, however, the change in
deviance is preferred because it is more useful for comparing models other than
the saturated model. To compute Pearson’s X 2 for any GLM fit, use
[1] 0.9236
Examining the coefficient estimates, the fraction of male doctorates are similar
in Type I(Pu) and Type I(Pr), because the coefficient for Type I(Pr) that
compares these two groups is small with a relatively large p value. The other
coefficients in the type by sex interaction are all negative and large relative to
their standard errors, so male doctorates are less frequent in institutions other
than Type I(Pu) and Type I(Pr). Similarly, U.S. citizens are less frequent in Type
IV, statistics and biostatistics programs, than in the other types.
5.6.3 SAMPLING PLANS FOR LOGLINEAR MODELS
An interesting characteristic of the AMSsurvey data is that the 2008–2009
doctorate recipients were simply classified according to the three variables.
Assuming that the counts in each of the cells has a Poisson distribution leads
immediately to the Poisson-regression model that we used.
Not all contingency tables are constructed in this way, however. Consider this
thought experiment: We will collect data to learn about the characteristics of
customers who use human tellers in banks to make transactions. We will study
the transactions between the tellers and the customers, classifying each
transaction by the factor age of the customer, A , with, say, three age groups; the
factor gender, B , either male or female; and the factor C , whether or not the
transaction was simple enough to be done at an ATM rather than with a human
teller. Each of the following sampling plans leads to a different distribution for
the counts, but Poisson regression can legitimately be used to estimate
parameters and perform tests in all these cases:
Poisson sampling Go to the bank for a fixed period of time, observe the
transactions, and classify them according to the three variables. In this case,
even the sample size n is random. Poisson-regression models are
appropriate here, and all parameters are potentially of interest.
Multinomial sampling Fix the sample size n in advance, and sample as long as
necessary to get n transactions. This scheme differs from the first sampling
plan only slightly: The counts are no longer independent Poisson random
variables, because their sum is constrained to equal the fixed sample size n,
and so the counts follow a multinomial distribution. Poisson regression can
be used, but the overall mean parameter (i.e., the intercept of the Poisson-
regression model) is determined by the sampling plan.
Fix two variables Sample a fixed number in each age × sex combination to get a
different multinomial sampling scheme. All models fit to the data must
include the terms 1 + A + B + A:B = A*B , because these are all fixed by
the sampling design.
To turn this data set into a form that can be used to fit a loglinear model, we need
to stack the variables voted and did.not.vote into one column to create the
response factor, an operation that can be performed conveniently with the melt
function in the reshape package (which, as its name implies, provides facilities
for rearranging data):
The first argument in the call to melt is the name of a data frame, and the second
argument gives the names of the ID variables that are duplicated in each row of
the new data frame. The measure variables are stacked into a column and given
the variable name value . The new data frame has one row for each of the 12
cells in the 2 × 3 × 2 contingency table.
Because we are interested in finding the important interactions with the
response, we start by fitting the saturated model and examining tests for various
terms:
The only terms of interest are those that include the response variable turnout ,
and all three of these terms, including the highest-order term
closeness:preference:turnout , have small p values.
As long as a loglinear model with a dichotomous response variable includes
the highest-order interaction among the predictors, here closeness:preference
, and obeys the principle of marginality, the loglinear model is equivalent to a
logistic-regression model. All the parameter estimates in the logistic regression
also appear in the loglinear model, but they are labeled differently. For example,
the closeness main effect in the logistic regression corresponds to the
closeness:turnout term in the loglinear model; similarly, the
closeness:preference interaction in the logistic regression corresponds to
closeness:preference:turnout in the loglinear model. Likelihood ratio tests
for corresponding terms are identical for the logistic-regression and loglinear
models, as the reader can verify for the example (cf., page 243). The only
important difference is that the residual deviance for the loglinear model
provides a goodness-of-fit test for that model, but it is not a goodness-of-fit test
for the logistic regression.11
[1] 397
We created the three-way table with xtabs , saved the result in the variable tab1
, and as before applied ftable to flatten the table for printing.
We turn the table into a data frame with one row for each cell to use with glm :
> (Salaries1 <− data.frame(tab1))
The data frame Womenlf in the car package contains data drawn from a social
survey of the Canadian population conducted in 1977. The data are for n = 263
married women between the ages of 21 and 30:
> some(Womenlf)
> nrow(Womenlf)
[1] 263
Then, for j = 2, … , m,
As a side effect, the call to multinom prints a summary of the iteration history
required to find the estimates. A Type II analysis of deviance for this model,
computed by the Anova function in the car package, shows that husband’s
income and presence of children both have highly statistically significant effects
but that the region term is nonsignificant:
> Anova(mod.multinom)
Response: partic
We simplify the model by removing the region effects and summarize the
resulting model:
> mod.multinom.1 <− update(mod.multinom, . ~ . − region)
Call:
multinom(formula = partic ~ hincome + children, data = Womenlf)
Residual Deviance: 422.9
AIC: 434.9
Warning message:
In analyze.model(term, mod, xlevels, default.levels) :
hincome:children does not appear in the model
The probability of full-time work decreases with hincome and decreases when
children are present, while the pattern is the opposite for not.working . Part-
time work is intermediate and is not strongly related to the two predictors.
Figure 5.6 Fitted probabilities of working full-time, part-time, and not working outside the home by
husband’s income and presence of children, from the multinomial-logit model fit to the Canadian women’s
labor-force data.
The first of these new variables, working , simply compares the not.work
category to a combination of the two working categories:
> xtabs(~ partic + working, data=Womenlf)
The second new variable—comparing part-time with full-time work— excludes
those who are not working:
If the response had more than three levels, we would continue to subdivide
compound levels such as working until only elementary levels (here, not.work ,
parttime , and fulltime ) remain. In general, we require m − 1 nested
dichotomies to represent the m levels of the response factor. In the current
example, we therefore use two dichotomies for the three levels of partic . We
will fit two binary logistic-regression models, the first with working as the
response and the second with fulltime as the response. Because the second
model considers only a subset of the cases, the combined models for the nested
dichotomies are not equivalent to the multinomial model that we fit in the
preceding section.
Because of their method of construction, models fit to different dichotomies
are statistically independent. This means, for example, that we can add deviances
and residual df over the models and can combine the models to get fitted
probabilities for the several categories of the polytomy.
For our example, the two models for the nested dichotomies are as follows:
> mod.working <− glm(working ~ hincome + children + region,
+ family=binomial, data=Womenlf)
> summary(mod.working)
Call:
glm(formula = working ~ hincome + children + region,
family = binomial, data = Womenlf)
Deviance Residuals:
Min 1Q Median 3Q Max
−1.793 −0.883 −0.728 0.956 2.007
Coefficients:
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 356.15 on 262 degrees of freedom Residual deviance: 317.30 on 256
degrees of freedom AIC: 331.3
Call:
glm(formula = fulltime ~ hincome + children + region,
family = binomial, data = Womenlf)
Null deviance: 144.34 on 107 degrees of freedom Residual deviance: 101.84 on 101
degrees of freedom
(155 observations deleted due to missingness) AIC: 115.8
> Anova(mod.working)
> Anova(mod.fulltime)
The predictor hincome ranges from 1 to 45, and the predictor children has the
values absent or present . For drawing graphs, we need fitted values for
combinations of the predictors covering their ranges; we use the expand.grid
function to produce combinations of hincome and children :
> (Predictors <− expand.grid(hincome=1:45,
+ children=c("absent", "present")))
We then employ the predict function, with its newdata argument set to the data
frame Predictors , to obtain fitted values for all combinations of the two
predictors:
Recall that specifying the argument type="response" to predict yields fitted
values on the probability scale, and that the default, type="link" , produces
fitted values on the logit scale. The fitted values for the fulltime dichotomy are
conditional on working outside the home; we multiply by the probability of
working to produce unconditional fitted probabilities of working full-time. The
unconditional probability of working part-time is found similarly; and the
probability of not working outside the home is calculated as the complement of
the probability of working.
Figure 5.7 Fitted probabilities from binary-logit models fit to nested dichotomies for the Canadian women’s
labor force data.
So as not to clutter the graph, we use the mfrow plot parameter to create two
panels: one for children absent and the other for children present. The result is
shown in Figure 5.7:13
The legend in the graph is positioned using the "topright" argument, and
inset=0.01 insets the legend by 1% from the edges of the plot.
Even though there are only three categories in the polytomy, there is more
than one way of forming nested dichotomies. For example, we could define the
alternative dichotomies: {full-time versus part-time and not working} and {part-
time versus not working}. Models for alternative sets of nested dichotomies are
not equivalent, and so this approach should only be used when there is a
substantively compelling resolution of the polytomy into a specific set of nested
dichotomies. In some areas, notably education, continuation-ratio logits (e.g.,
Agresti, 2002, sec. 7.4.3) are commonly used—for example, {less than high
school versus some high school or more}, {incomplete high school versus high
school graduate or more}, {high school graduate versus some post secondary or
more}, and so on.
If ξ were observable and we knew the distribution of the errors ε, then we could
estimate the regression coefficients using standard methods. Rather than
observing ξ, however, we instead observe another variable, y, derived from ξ. In
our imaginary example, the observable variable might be the grade the student
received in the course. Grade is an ordinal variable consisting of the ordered
categories F, D, C, B, and A, from lowest to highest, corresponding
consecutively to y = 1, 2, 3, 4, 5. In general, y is an ordinal variable with m levels
1, … , m, such that
where α1 < … < αm−1 are the thresholds between each level of y and the next.
Thus y is a crudely measured version of ξ. The dissection of ξ into m levels
introduces m − 1 additional parameters into the model.
The cumulative probability distribution of y is given by
Figure 5.8 The proportional-odds model: Cumulative probabilities, Pr(y > j), plotted against the linear
predictor, η, for a four-category ordered response.
Call:
polr(formula = partic ~ hincome + children, data = Womenlf)
The AIC for the proportional-odds model (449.7) is quite a bit larger than for the
multinomial-logit model fit earlier (434.9), casting doubt on the assumption of
proportional odds. A rough analysis of deviance yields a p value of .00008,
suggesting the inadequacy of the proportional-odds model:
> pchisq(deviance(mod.polr) − deviance(mod.multinom.1),
+ df = 6 − 4, lower.tail=FALSE)
[1] 8.351e−05
The fit of the proportional-odds model, shown in Figure 5.9, is also quite
different from that of the multinomial-logit model (Figure 5.6) and the nested-
dichotomies model (Figure 5.7):
> plot(effect("hincome*children", mod.polr))
Warning message:
In analyze.model(term, mod, xlevels, default.levels) :
hincome:children does not appear in the model
Alternative displays are shown in Figure 5.10, using stacked-area plots, and in
Figure 5.11, by plotting the estimated latent response:
> plot(effect("hincome*children", mod.polr), style="stacked")
5.10 Extensions
Figure 5.10 Estimated probabilities, shown as stacked areas, from the proportional-odds model fit to the
Canadian women’s labor-force data.
An example for which gamma errors are likely to be useful is the Transact
data, introduced in Section 4.3.6. Errors in these data probably increase with the
size of the response, and the gamma assumption of constant coefficient of
variation is reasonable, as suggested by Cunningham and Heathcote (1989) for
these data:
Figure 5.11 The estimated latent response from the proportional-odds model fit to the Canadian women’s
labor-force data. The dotted horizontal lines are the estimated thresholds between adjacent levels of the
observed response.
Figure 5.12 Gamma densities for various values of the shape parameter, α, with μ = 1 fixed.
Null deviance: 92.603 on 260 degrees of freedom Residual deviance: 7.477 on 258 degrees
of freedom AIC: 4322
The canonical link for the Gamma family is the inverse link (see Table 5.2, p.
231), and if we were to use that link, we would then model the mean for time as
E( time |t1 , t2 ) = 1/( β0 + β1t1 + β2t2 ). Using the identity link in this
example, the coefficients of the gamma-regression model will have the same
interpretations as the coefficients of the linear model that we fit by least squares
in Section 4.3.6: The intercept is the amount of time required independent of the
transactions; the coefficient of t1 is the typical time required for each Type 1
transaction; and the coefficient of t2 is the typical time required for each Type 2
transaction. The inverse scale loses this simple interpretation of the coefficients.
In general, however, using the identity link with the gamma family is
problematic because it can lead to negative fitted values for a strictly positive
response, a problem that doesn’t occur for our example.
The estimate of the shape parameter, which is the inverse of the dispersion
parameter in the GLM, is returned by the gamma.shape function:
> gamma.shape(trans.gamma)
Alpha: 35.073
SE: 3.056
The large estimated shape implies that the error distribution is nearly symmetric.
The first two problems are addressed in this section. The third problem will be
discussed in Chapter 6.
One approach to overdispersion is to estimate a scale parameter ϕ rather than
use the assumed value ϕ = 1 for the Poisson or binomial distributions. The usual
estimator of the dispersion parameter is = X 2/df, the value of Pearson’s X 2
divided by the residual df . Estimating the dispersion has no effect on the
coefficient estimates, but it inflates all their standard errors by the factor ϕ.
[1] 6.399
Standard errors and Wald tests adjusted for overdispersion may be obtained by
the command
> summary(mod.ornstein, dispersion=phihat)
Call:
glm(formula = interlocks ~ log2(assets) + nation +
sector, family = poisson, data = Ornstein)
Null deviance: 3737.0 on 247 degrees of freedom Residual deviance: 1547.1 on 234
degrees of freedom AIC: 2473
The Anova function in the car package uses this estimate of φ to get tests if we
specify test="F" :
> Anova(mod.ornstein, test="F")
To select a value for θ, we could choose a grid of reasonable values and then
select the one that minimizes the AIC:
Null deviance: 487.78 on 247 degrees of freedom Residual deviance: 279.50 on 234
degrees of freedom AIC: 1674
Null deviance: 521.58 on 247 degrees of freedom Residual deviance: 296.52 on 234
degrees of freedom AIC: 1675
The estimate θ = 1.639 is very close to the value 1.5 that we picked by grid
search.
We have already discussed in some detail the use of the formula and family
arguments. The data , subset , na.action , and contrasts arguments work as
in lm (see Section 4.8).
Here are a few comments on the other arguments to glm :
5.11.1 weights
The weights argument is used to specify prior weights, which are a vector of
n positive numbers. Leaving off this argument effectively sets all the prior
weights to 1. For Gaussian GLMs, the weights argument is used for weighted-
least-squares regression, and this argument can serve a similar purpose for
gamma GLMs. For a Poisson GLM, the weights have no obvious elementary
use. As we mentioned in Section 5.4, the weights argument may also be used to
specify binomial denominators (i.e., total counts) in a binomial GLM.
The IWLS computing algorithm used by glm (Section 5.12) produces a set of
working weights, which are different from the prior weights. The weights
function applied to a fitted glm object retrieves the prior weights.
5.11.2 start
This argument supplies start values for the coefficients in the linear predictor.
One of the remarkable features of GLMs is that the automatic algorithm that is
used to find starting values is almost always effective, and the user therefore
need not provide starting values.
5.11.3 offset
and we would therefore want to fit a Poisson regression with the linear predictor
η( x) + loge N. This is accomplished by setting the argument offset=log(N) .
5.11.4 control
This argument allows the user to set several technical options, in the form of a
list, controlling the IWLS fitting algorithm (described in the next section):
epsilon , the convergence criterion (which defaults to 0.0001 ), representing the
maximum relative change in the deviance before a solution is declared and
iteration stops; maxit , the maximum number of iterations (default, 10 ); and
trace (default, FALSE ), which if TRUE causes a record of the IWLS iterations to
be printed. These control options can also be specified directly as arguments to
glm . The ability to control the IWLS fitting process is sometimes useful—for
example, when convergence problems are encountered.
where the ci are fixed constants (e.g., in the binomial family, ci = n− 1 i ). Then
we perform a weighted-least-squares regression of z(t) on the xs in the linear
predictor, minimizing the weighted sum of squares where x i is the
ith row of the model matrix X of the regressors, obtaining new estimates of the
regression parameters, β(t+1). This process is initiated with suitable starting
values β(0) and continues until the coefficients stabilize at the maximum-
likelihood estimates .
The estimated asymptotic covariance matrix of is obtained from the last
iteration of the IWLS procedure as
where W = diag{wi} and (if ϕ is to be estimated from the data) is the Pearson
statistic divided by the residual df for the model.
Binomial logistic regression provides a relatively simple illustration; we have
(after algebraic manipulation)
where Ni is the number of trials associated with the ith observation, and φ = 1 for
the binomial.
generate contrasts for the factors (see Section 4.4.4), we could have interpreted
each of the main-effects tests as an average over the levels of the other factor.
These tests, however, would be of dubious interest in light of the interaction.
6Actually, Ornstein used both assets and the log of assets in a least-squares
regression. Poisson regression was essentially unknown to sociologists at the
time, and so we don’t mean to imply criticism of Ornstein’s work.
7We could set common scales for the vertical axes by, for example, specifying
the argument ylim=log(c(1, 80)) to plot , but then the graphs for nation and
sector would be considerably compressed.
8It is traditional in loglinear models to call terms such as sex:citizen
“interactions,” even though they represent not interaction in the usual sense of
the word—that is, a change in the partial relationship between two variables
across different levels of a third variable—but rather the association between a
pair of variables.
9As in the case of two-way tables, the terms main effect and interaction can be
misleading: The “main effects” pertain to the marginal distributions of the three
variables and the “two-way interactions” to the partial associations between pairs
of variables. The “three-way interaction” represents interaction in the more usual
sense, in that the presence of this term in the model implies that the partial
association between each pair of variables varies over the levels of the third
variable.
10Complex survey data can be properly analyzed in R using the survey package,
which, among its many facilities, has a function for fitting GLMs. We do not,
however, have the original data set from which our contingency table was
constructed.
11These conclusions extend to polytomous (i.e., multi-category) responses,
where loglinear models that fit the highest-order interaction among the
predictors are equivalent to multinomial-logit models (described in Section 5.7).
12There are other packages that also can fit multinomial-logit models. Of
particular note is the VGAM package, which fits a wide variety of regression
models for categorical and other data.
13General strategies for constructing complex graphs are described in Chapter 7.
14The vglm function in the VGAM packages fits a wide variety of models,
including both the proportional-odds model and a similar model for cumulative
logits that doesn’t impose the assumption of parallel regressions. We had trouble,
however, using vglm to fit this model to the data in our example.
15The polr function can also be used to fit some other similar models for an
but doing so here would confuse the mean of the marginal distribution of the
random effect S with the mean function for the conditional distribution of y| x, S.
This notation is also consistent with the labeling of output in the glm.nb
function.
19The values
are called working residuals and play a role in diagnostics for GLMs (see
Section 6.6).
Diagnosing Problems in Linear and
6
Generalized Linear Models
6.1 Residuals
Residuals of one sort or another are the basis of most diagnostic methods.
Suppose that we specify and fit a linear model assuming constant error variance
σ2. The ordinary residuals are given by the differences between the responses
and the fitted values:
While the eSi have constant variance, they are no longer uncorrelated with the
fitted values or linear combinations of the regressors, so using standardized
residuals in plots is not an obvious improvement.
Studentized residuals are given by
where is the estimate of σ2 computed from the regression without the ith
observation. Like the standardized residuals, the Studentized residuals have
constant variance. In addition, if the original errors are normally distributed, then
eTi follows a t distribution with n − k − 2 df and can be used to test for outliers
(see Section 6.3). One can show that
and so computing the Studentized residuals doesn’t really require refitting the
regression without the ith observation.
If the model is fit by WLS regression with known positive weights wi, then the
ordinary residuals are replaced by the Pearson residuals:
The car package includes a number of functions that produce plots of residuals
and related quantities. The variety of plots reflects the fact that no one diagnostic
graph is appropriate for all purposes.
To help examine these residual plots, a lack-of-fit test is computed for each
numeric predictor, and a curve is added to the graph. The lack-of-fit test for
education , for example, is the t test for the regressor ( education )2 added to
the model, for which the corresponding p value rounds to .50, indicating no lack-
of-fit of this type. For income , the lack-of-fit test has the p value .005, clearly
confirming the nonlinear pattern visible in the graph. The lines shown on the plot
are the fitted quadratic regressions of the Pearson residuals on the numeric
predictors.
For the plot of residuals versus fitted values, the test—called Tukey’s test for
nonadditivity (Tukey, 1949)—is obtained by adding the squares of the fitted
values to the model and refitting. The significance level for Tukey’s test is
obtained by comparing the statistic with the standard-normal distribution. The
test confirms the visible impression of curvature in the residual plot, further
reinforcing the conclusion that the fitted model is not adequate.
The residualPlots function shares many arguments with other graphics
functions in the car package; see ?residualPlots for details. In residualPlots
, all arguments other than the first are optional. The argument id.n could be set
to a positive number to identify automatically the id.n most unusual cases,
which by default are the cases with the largest (absolute) residuals (see Section
3.5). There are additional arguments to control the layout of the plots and the
type of residual plotted. For example, setting type="rstudent" would plot
Studentized residuals rather than Pearson residuals. Setting smooth=TRUE,
quadratic=FALSE would display a lowess smooth rather than a quadratic curve
on each plot, although the test statistics always correspond to the fitting
quadratics.
If you want only the plot of residuals against fitted values, you can use
> residualPlots(prestige.mod.2, ~ 1, fitted=TRUE)
These plots (shown in Figure 6.2) all have the response variable, in this case
prestige , on the vertical axis, while the horizontal axis is given in turn by each
of the numeric predictors in the model and the fitted values. The plots of the
response versus individual predictors display the conditional distribution of the
response given each predictor, ignoring the other predictors; these are marginal
plots in the sense that they show the marginal relationship between the response
and each predictor. The plot versus fitted values is a little different in that it
displays the conditional distribution of the response given the fit of the model.
Figure 6.2 Marginal-model plots for the regression of prestige on education , income , and type in the
Prestige data set.
We can estimate a regression function for each of the marginal plots by fitting
a smoother to the points in the plot. The marginalModelPlots function uses a
lowess smooth, as shown by the solid line on the plot.
Now imagine a second graph that replaces the vertical axis with the fitted
values from the model. If the model is appropriate for the data, then, under fairly
mild conditions, the smooth fit to this second plot should also estimate the
conditional expectation of the response given the predictor on the horizontal
axis. The second smooth is also drawn on the marginal model plot, as a dashed
line. If the model fits the data well, then the two smooths should match on each
of the marginal model plots; if any pair of smooths fails to match, then we have
evidence that the model does not fit the data well.
An interesting feature of the marginal model plots in Figure 6.2 is that even
though the model that we fit to the Prestige data specifies linear partial
relationships between prestige and each of education and income ,it is able to
reproduce nonlinear marginal relationships for these two predictors. Indeed, the
model, as represented by the dashed lines, does a fairly good job of matching the
marginal relationships represented by the solid lines, although the systematic
failures discovered in the residual plots are discernable here as well.
Marginal model plots can be used with any fitting or modeling method that
produces fitted values, and so they can be applied to some problems where the
definition of residuals is unclear. In particular, marginal model plots generalize
nicely to GLMs.
The marginalModelPlots function has an SD argument, which if set to TRUE
adds estimated standard deviation lines to the graph. The plots can therefore be
used to check both the regression function, as illustrated here, and the
assumptions about variance. Other arguments to the marginal-ModelPlots
function are similar to those for residualPlots .
The added-variable plot for x1 is simply a scatterplot with the residuals from
Step 1 on the vertical axis and the residuals from Step 2 on the horizontal axis.
The avPlots function in the car package works both for linear models and
GLMs. It has arguments for controlling which plots are drawn, point labeling,
and plot layout, and these arguments are the same as for the residualPlots
function (described in Section 6.2.1).
Added-variable plots for the Canadian occupational-prestige regression (in
Figure 6.3) are produced by the following command:
> avPlots(prestige.mod.2, id.n=2, id.cex=0.6)
Figure 6.3 Added-variable plots for the regression of prestige on education , income , and type in the
Prestige data set.
The argument id.n=2 will result in identifying up to four points in each graph,
the two that are farthest from the mean on the horizontal axis and the two with
the largest absolute residuals from the fitted line. Because the case labels in the
Prestige data set are very long, we used id.cex=0.6 to reduce the printed
labels to 60% of their default size.
The added-variable plot has several interesting and useful properties:
The least-squares line on the added-variable plot for the regressor xj has
slope bj, equal to the partial slope for xj in the full regression. Thus, for
example, the slope in the added-variable plot for education is b1 = 3.67,
and the slope in the added-variable plot for income is b2 = 0.00101. (The
income slope is small because the unit of income— $1 of annual income—
is small.)
The residuals from the least-squares line in the added-variable plot are the
same as the residuals ei from the regression of the response on all of the
regressors.
Because the positions on the horizontal axis of the added-variable plot show
values of xj conditional on the other regressors, points far to the left or right
represent observations for which the value of xj is unusual given the values
of the other regressors. Likewise, the variation of the variable on the
horizontal axis is the conditional variation of xj, and the added-variable plot
therefore allows us to visualize the precision of the estimation of bj.
For factors, an added-variable plot is produced for each of the contrasts that
are used to define the factor, and thus, if we change the way the contrasts
are coded for a factor, the corresponding added-variable plots will change
as well.
The added-variable plot allows us to visualize the effect of each regressor after
adjusting for all the other regressors in the model. In Figure 6.3, the plot for
income has a positive slope, but the slope appears to be influenced by two high-
income occupations (physicians and general managers), which pull down the
regression line at the right. There don’t seem to be any particularly noteworthy
points in the added-variable plots for the other regressors.
Although added-variable plots are useful for studying the impact of
observations on regression coefficients (see Section 6.3.3), they can prove
misleading when diagnosing other sorts of problems, such as nonlinearity. A
further disadvantage of the added-variable plot is that the variables on both axes
are sets of residuals, and so neither the response nor the regressors are displayed
directly.
Sall (1990) and Cook and Weisberg (1991) generalize added-variable plots to
terms with more than 1 df , such as a factor or polynomial regressors. Following
Sall, we call these graphs leverage plots. For terms with 1 df , the leverage plots
are very similar to added-variable plots, except that the slope in the plot is
always equal to 1, not to the corresponding regression coefficient. Although
leverage plots can be misleading in certain circumstances,4 they can be useful for
locating groups of cases that are jointly high-leverage or influential. Leverage,
influence, and related ideas are explored in the next section. There is a
leveragePlots function in the car package, which works only for linear models.
yi = α + β1xi1 + … + βkxik + γ di + εi
where di is a dummy regressor coded 1 for observation i and 0 for all other
observations. If γ ≠ 0, then the conditional expectation of the ith observation has
the same dependence on x1, … , xk as the other observations, but its intercept is
shifted from α to α + γ . The t statistic for testing the null hypothesis H0: γ = 0
against a two-sided alternative has n − k − 2 df if the errors are normally
distributed and is the appropriate test for a single mean-shift outlier at
observation i. Remarkably, this t statistic turns out to be identical to the ith
Studentized residual, eTi (Equation 6.4, p. 287), and so we can get the test
statistics for the n different null hypotheses, H0i: case i is not a mean-shift
outlier, i = 1, … , n, at minimal computational cost.
Our attention is generally drawn to the largest absolute Studentized residual,
and this presents a problem: Even if the Studentized residuals were independent,
which they are not, there would be an issue of simultaneous inference entailed
by picking the largest of n test statistics. The dependence of the Studentized
residuals complicates the issue. We can deal with this problem (a) by a
Bonferroni adjustment of the p value for the largest absolute Studentized
residual, multiplying the usual two-tail p by the sample size, n,or (b) by
constructing a quantile-comparison plot of the Studentized residuals with a
confidence envelope that takes their dependence into account.
We reconsider Duncan’s occupational-prestige data (introduced in Section
1.2), regressing prestige on occupational income and education levels:
> mod.duncan <− lm(prestige ~ income + education, data=Duncan)
The generic qqPlot function in the car package has a method for linear
models, plotting Studentized residuals against the corresponding quantiles of t( n
−k −2). By default, qqPlot generates a 95% pointwise confidence envelope for
the Studentized residuals, using a parametric version of the bootstrap, as
suggested by Atkinson (1985):5
> qqPlot(mod.duncan, id.n=3)
The resulting plot is shown in Figure 6.4. Setting the argument id.n=3 , the
qqPlot function returns the names of the three observations with the largest
absolute Studentized residuals (see Section 3.5 on point identification); in this
case, only one observation, minister , strays slightly outside of the confidence
envelope. If you repeat this command, your plot may look a little different from
ours because the envelope is computed by simulation. The distribution of the
Studentized residuals looks heavy-tailed compared to the reference t distribution:
Perhaps a method of robust regression would be more appropriate for these
data.6
Figure 6.4 Quantile-comparison plot of Studentized residuals from Duncan’s occupational-prestige
regression, showing the pointwise 95% simulated confidence envelope.
The outlierTest function in the car package locates the largest Studentized
residual in absolute value and computes the Bonferroni-corrected t test:
> outlierTest(mod.duncan)
The hatvalues function works for both linear models and GLMs. One way of
examining the hat-values and other individual-observation diagnostic statistics is
to construct index plots, graphing the statistics against the corresponding
observation indices.
For example, the following command uses the car function
influenceIndexPlot to produce Figure 6.5, which includes index plots of
Studentized residuals, the corresponding Bonferroni p values for outlier testing,
the hat-values, and Cook’s distances (discussed in the next section) for Duncan’s
occupational-prestige regression:
> influenceIndexPlot(mod.duncan, id.n=3)
The occupations railroad engineer (RR.engineer ), conductor , and minister
stand out from the rest in the plot of hat-values, indicating that their regressor
values are unusual relative to the other occupations. In the plot of p values for
the outlier tests, cases for which the Bonferroni bound is bigger than 1 are set
equal to 1, and here only one case (minister ) has a Bonferroni p value much
less than 1.
COOK’S DISTANCE
It is convenient to summarize the size of the difference b(−i) − b by a single
number, and this can be done in several ways. The most common summary
measure of influence is Cook’s distance (Cook, 1977), Di, which is just a
weighted sum of squares of the differences between the individual elements of
the coefficient vectors.9 Interestingly, Cook’s distance can be computed from
diagnostic statistics that we have already encountered:
where is the squared standardized residual (Equation 6.3, p. 286) and hi is the
hat-value for observation i. The first factor may be thought of as a measure of
outlyingness and the second as a measure of leverage. Observations for which Di
is large are potentially influential cases. If any noteworthy Di are apparent, then
it is prudent to remove the corresponding cases temporarily from the data, refit
the regression, and see how the results change. Because an influential
observation can affect the fit of the model at other observations, it is best to
remove observations one at a time, refitting the model at each step and
reexamining the resulting Cook’s distances.
The generic function cooks.distance has methods for linear models and
GLMs. Cook’s distances are also plotted, along with Studentized residuals and
hat-values, by the influenceIndexPlot function, as illustrated for Duncan’s
regression in Figure 6.5. The occupation minister is the most influential
according to Cook’s distance, and we therefore see what happens when we
delete this case and refit the model:
Figure 6.6 Plot of hat-values, Studentized residuals, and Cook’s distances for Duncan’s occupational-
prestige regression. The size of the circles is proportional to Cook’s Di.
The compareCoefs function displays the estimates from one or more fitted
models in a compact table. Removing minister increases the coefficient for
income by about 20% and decreases the coefficient for education by about the
same amount. Standard errors are much less affected. In other problems,
removing an observation can change significant results to insignificant ones, and
vice-versa.
The influencePlot function in the car package provides an alternative to
index plots of diagnostic statistics:
> influencePlot(mod.duncan, id.n=3)
This command produces a bubble-plot, shown in Figure 6.6, which combines the
display of Studentized residuals, hat-values, and Cook’s distances, with the areas
of the circles proportional to Cook’s Di.10 As usual the id.n argument is used to
label points. In this case, the id.n points with the largest hat-values, Cook’s
distances, or absolute Studentized residuals will be flagged, so more than id.n
points in all may be labeled.
Figure 6.8 dfbetasij values for the income and education coefficients in Duncan’s occupational-prestige
regression. Three points were identified interactively.
The negative relationship between the dfbetasij values for the two regressors
reflects the positive correlation of the regressors themselves. Two pairs of values
stand out: Consistent with our earlier remarks, observations minister and
conductor make the income coefficient smaller and the education coefficient
larger. We also identified the occupation RR.engineer in the plot.
NONNORMAL ERRORS
Departures from the assumption of normally distributed errors are probably
the most difficult problem to diagnose. The only data available for studying the
error distribution are the residuals. Even for an otherwise correctly specified
model, the residuals can have substantially different variances, can be strongly
correlated, and tend to behave more like a normal sample than do the original
errors, a property that has been called supernormality (Gnanadesikan, 1977).
A quantile-comparison plot of Studentized residuals against the t distribution
(as described in Section 6.3.1) is useful in drawing our attention to the tail
behavior of the residuals, possibly revealing heavy-tailed or skewed
distributions. A nonparametric density estimate, however, does a better job of
conveying a general sense of the shape of the residual distribution.
In Section 5.5, we fit a Poisson regression to Ornstein’s data on interlocking
directorates among Canadian corporations, regressing the number of interlocks
maintained by each firm on the firm’s assets, nation of control, and sector of
operation. Because number of interlocks is a count, the Poisson model is a
natural starting point, but the original source used a least-squares regression
similar to the following:
> mod.ornstein <− lm(interlocks + 1 ~ log(assets) + nation + sector,
+ data=Ornstein)
We put interlocks + 1 on the left-hand side of the model formula because
there are some 0 values in interlocks and we will shortly consider power
transformations of the response variable.
Quantile-comparison and density plots of the Studentized residuals for
Ornstein’s regression are produced by the following R commands (Figure 6.9):
> par(mfrow=c(1,2))
> qqPlot(mod.ornstein, id.n=0)
> plot(density(rstudent(mod.ornstein)))
Both tails of the distribution of Studentized residuals are heavier than they
should be, but the upper tail is even heavier than the lower one, and
consequently, the distribution is positively skewed. A positive skew in the
distribution of the residuals can often be corrected by transforming y down the
ladder of powers. The next section describes a systematic method for selecting a
normalizing transformation of y.
BOX-COX TRANSFORMATIONS
The goal of fitting a model that exhibits linearity, constant variance, and
normality can in principle require three different response transformations, but
experience suggests that one transformation is often effective for all of these
tasks. The most common method for selecting a transformation of the response
in regression was introduced by Box and Cox (1964). If the response y is a
strictly positive variable, then the Box-Cox power transformations (introduced in
Section 3.4.2), implemented in the bcPower function in the car package, are
often effective:
Figure 6.9 Quantile-comparison plot and nonparametric density estimate for the distribution of the
Studentized residuals from Ornstein’s interlocking-directorate regression.
If y is not strictly positive, then the Yeo-Johnson family, computed by the
yjPower function, can be used in place of the Box-Cox family; alternatively, we
can add a start to y to make all the values positive (as explained in Section
3.4.2).
Box and Cox proposed selecting the value of λ by analogy to the method of
maximum likelihood, so that the residuals from the linear regression of TBC( y, λ)
on the predictors are as close to normally distributed as possible.11 The car
package provides two functions for estimating λ. The first, boxCox , is a slight
generalization of the boxcox function in the MASS package (Venables and
Ripley, 2002).12 The second is the powerTransform function introduced in a
related context in Section 3.4.7.
For Ornstein’s least-squares regression, for example,
> boxCox(mod.ornstein, lambda = seq(0, 0.6, by=0.1))
This command produces the graph of the profile log-likelihood function for λ in
Figure 6.10. The best estimate of λ is the value that maximizes the profile
likelihood, which in this example is λ ≈ 0.2. An approximate 95% confidence
interval for λ is the set of all λs for which the value of the profile log-likelihood
is within 1.92 of the maximum—from about 0.1 to 0.3.13 It is usual to round the
estimate of λ to a familiar value, such as −1, −1/2, 0, 1/3, 1/2, 1, or 2. In this
case, we would round to the cube-root transformation, λ = 1/3. Because the
response variable interlocks is a count, however, we might prefer the log
transformation (λ = 0) or the square-root transformation (λ = 1/2).
Figure 6.10 Profile log-likelihood for the transformation parameter λ in the Box-Cox model applied to
Ornstein’s interlocking-directorate regression.
This command saves the transformed values with λ rounded to the convenient
value in the confidence interval that is closest to the point estimate. If none of
the convenient values are in the interval, then no rounding is done.
The constructed variable is added as a regressor, and the t statistic for this
variable is the approximate score statistic for the transformation. Although the
score test isn’t terribly interesting in light of the ready availability of likelihood
ratio tests for the transformation parameter, an added-variable plot for the
constructed variable in the auxiliary regression—called a constructed-variable
plot—shows leverage and influence on the decision to transform y.
The boxCoxVariable function in the car package facilitates the computation
of the constructed variable. Thus, for Ornstein’s regression:
Figure 6.11 Constructed-variable plot for the Box-Cox transformation of y in Ornstein’s interlocking-
directorate regression.
We are only interested in the t test and added-variable plot for the constructed
variable, so we printed only the row of the coefficient table for that variable. The
argument drop=FALSE told R to print the result as a matrix, keeping the labels,
rather than as a vector (see Section 2.3.4). The constructed-variable plot is
obtained using avPlots , with the second argument specifying the constructed
variable. The resulting constructed-variable plot is shown in Figure 6.11. The t
statistic for the constructed variable demonstrates that there is very strong
evidence of the need to transform y, agreeing with the preferred likelihood ratio
test. The constructed-variable plot suggests that this evidence is spread through
the data rather than being dependent on a small fraction of the observations.
Call:
lm(formula = cycles ~ len + amp + load, data = Wool)
The inverse response plot for the model is drawn by the following command
(and appears in Figure 6.12):
> inverseResponsePlot(wool.mod, id.n=4)
Four lines are shown on the inverse response plot, each of which is from the
nonlinear regression of on TBC( y, λ), for λ = −1, 0, 1 and for the value of λ
that best fits the points in the plot. A linearizing transformation of the response
would correspond to a value of λ that matches the data well. In the example, the
linearizing transformation producing the smallest residual sum of squares, λ =
−0.06, is essentially the log-transform. As can be seen on the graph, the optimal
transformation and log-transform produce essentially the same fitted line, while
the other default choices are quite a bit worse. The printed output from the
function gives the residual sums of squares for the four fitted lines.
As an alternative approach, the Box-Cox method can be used to find a
normalizing transformation, as in the original analysis of these data by Box and
Cox (1964):
> summary(powerTransform(wool.mod))
Both methods therefore suggest that the log-transform is appropriate here. The
reader is invited to explore these data further. Without transformation, inclusion
of higher-order terms in the predictors is required, but in the log-transformed
scale, there is a very simple model that closely matches the data.
One advantage of the inverse response plot is that we can visualize the
leverage and influence of individual observations on the choice of a
transformation; separated points tend to be influential. In Figure 6.12, we
marked the four points with the largest residuals from the line for λ = 1. All these
points are very well fit by the log-transformed curve and are in the same pattern
as the rest of the data; there are no observations that appear to be overly
influential in determining the transformation.
For the Ornstein data described earlier in this section, the inverse response
plot (not shown) is not successful in selecting a transformation of the response.
For these data, the problem is lack of normality, and the inverse response plots
transform for linearity, not directly for normality.
Figure 6.13 Component-plus-residual plots of order=2 for the Canadian occupational-prestige regression.
The component-plus-residual plots for the three predictors appear in Figure 6.13.
The broken line on each panel is the partial fit, bjxj, assuming linearity in the
partial relationship between y and xj. The solid line is a lowess smooth, and it
should suggest a transformation if one is appropriate, for example, via the
bulging rule (see Section 3.4.6). Alternatively, the smooth might suggest a
quadratic or cubic partial regression or, in more complex cases, the use of a
regression spline.
For the Canadian occupational-prestige regression, the component-plus-
residual plot for income is the most clearly curved, and transforming this
variable first and refitting the model is therefore appropriate. In contrast, the
component-plus-residual plot for education is only slightly nonlinear, and the
partial relationship is not simple (in the sense of Section 3.4.6). Finally, the
component-plus-residual plot for women looks mildly quadratic (although the
lack-of-fit test computed by the residualPlots command does not suggest a
significant quadratic effect), with prestige first declining and then rising as
women increases.
Trial-and-error experimentation moving income down the ladder of powers
and roots suggests that a log transformation of this predictor produces a
reasonable fit to the data:
> prestige.mod.4 <− update(prestige.mod.3,
+ . ~ . + log2(income) − income)
which is the model we fit in Section 4.2.2. The component-plus-residual plot for
women in the revised model (not shown) is broadly similar to the plot for women in
Figure 6.13 (and the lack-of-fit test computed in residualPlots has a p value of
.025) and suggests a quadratic regression:
> prestige.mod.5 <− update(prestige.mod.4,
+ . ~ . − women + poly(women, 2))
> summary(prestige.mod.5)$coef
The quadratic term for women is statistically significant but not overwhelmingly
so.
If the regressions among the predictors are strongly nonlinear and not well
described by polynomials, then the component-plus-residual plots may not be
effective in recovering nonlinear partial relationships between the response and
the predictors. For this situation, Cook (1993) provides another generalization of
component-plus-residual plots, called CERES plots (for Combining conditional
Expectations and RESiduals). CERES plots use nonparametric-regression
smoothers rather than polynomial regressions to adjust for nonlinear
relationships among the predictors. The ceresPlots function in the car package
implements Cook’s approach.
Experience suggests that nonlinear relationships among the predictors create
problems for component-plus-residual plots only when these relationships are
very strong. In such cases, a component-plus-residual plot can appear nonlinear
even when the true partial regression is linear—a phenomenon termed leakage.
For the Canadian occupational-prestige regression, higher-order component-
plus-residual plots (in Figure 6.13) and CERES plots are nearly identical to the
standard component-plus-residual plots, as the reader may verify.
The one-sided formula for the argument other.x indicates the terms in the
model that are not to be transformed—here the quadratic in women . The score
tests for the power transformations of income and education suggest that both
predictors need to be transformed; the maximum-likelihood estimates of the
transformation parameters are = −0.04 for income (effectively, the log
transformation of income ) and = 2.2 for education (effectively, the square of
education ).
Constructed variables for the Box-Tidwell transformations of the predictors
are given by xj loge xj. These can be easily computed and added to the regression
model to produce approximate score tests and constructed-variable plots. Indeed,
these constructed variables are the basis for Box and Tidwell’s computational
approach to fitting the model and yield the score statistics printed by the
boxTidwell function.
Figure 6.14 Constructed-variable plots for the Box-Tidwell transformation of income and education in the
Canadian occupational-prestige regression.
To obtain constructed-variable plots (Figure 6.14) for income and education
in the Canadian occupational-prestige regression:16
The identity function I() was used to protect the multiplication operator (* ),
which would otherwise be interpreted specially within a model formula,
inappropriately generating main effects and an interaction (see Section 4.8).
The constructed-variable plot for income reveals some high-leverage points in
determining the transformation of this predictor, but even when these points are
removed, there is still substantial evidence for the transformation in the rest of
the data.
Figure 6.15 Plot of Pearson residuals against fitted values for Ornstein’s interlocking-directorate regression.
The obvious fan-shaped array of points in this plot indicates that residual
variance appears to increase as a function of the fitted values—that is, with the
estimated magnitude of the response. In Section 5.5, we modeled these data
using Poisson regression, for which the variance does increase with the mean,
and so reproducing that pattern here is unsurprising. A less desirable alternative
to a regression model that is specifically designed for count data is to try to
stabilize the error variance in Ornstein’s least-squares regression by transforming
the response, as described in the next section.
Figure 6.16 Spread-level plot of Studentized residuals against fitted values, for Ornstein’s interlocking-
directorate regression.
Warning message:
In spreadLevelPlot.lm(mod.ornstein) :
16 negative fitted values removed
The linear-regression model fit to Ornstein’s data doesn’t constrain the fitted
values to be positive, even though the response variable interlocks + 1 is
positive. The spreadLevelPlot function removes negative fitted values, as
indicated in the warning message, before computing logs. The spread-level plot,
shown in Figure 6.16, has an obvious tilt to it. The suggested transformation, λ =
0.55, is not quite as strong as the normalizing transformation estimated by the
Box-Cox method, = 0.22 (Section 6.4.1).
Both tests are highly statistically significant, and the difference between the two
suggests that the relationship of spread to level does not entirely account for the
pattern of nonconstant error variance in these data. It was necessary to supply the
data argument in the second command because the ncvTest function does not
assume that the predictors of the error variance are included in the linear-model
object.
Formulas for Var( y| x) are given in the last column of Table 5.2 (p. 231).
This definition of ePi corresponds exactly to the Pearson residuals defined
in Equation 6.6 (p. 287) for WLS regression. These are a basic set of
residuals for use with a GLM because of their direct analogy to linear
models. For a model named m1 , the command residuals(m1,
type="pearson") returns the Pearson residuals.
Standardized Pearson residuals correct for conditional response variation
and for the leverage of the observations:
To compute the ePSi, we need to define the hat-values hi for GLMs. The hi
are taken from the final iteration of the IWLS procedure for fitting the
model and have the usual interpretation, except that, unlike in a linear
model, the hat-values in a GLM depend on y as well as on the configuration
of the xs.
Deviance residuals, eDi, are the square roots of the casewise components of
the residual deviance, attaching the sign of yi − i. In the linear model, the
deviance residuals reduce to the Pearson residuals. The deviance residuals
are often the preferred form of residual for GLMs, and are returned by the
command residuals(m1, type= "deviance") .
Standardized deviance residuals are
We used the layout argument to reformat the graph to have one row and three
columns. The function plots Pearson residuals versus each of the predictors in
turn. Instead of plotting residuals against fitted values, however, residualPlots
plots residuals against the estimated linear predictor, ( x). Each panel in the
graph by default includes a smooth fit rather than a quadratic fit; a lack-of-fit test
is provided only for the numeric predictor hincome and not for the factor
children or for the estimated linear predictor.
In binary regression, the plots of Pearson residuals or deviance residuals are
strongly patterned—particularly the plot against the linear predictor, where the
residuals can take on only two values, depending on whether the response is
equal to 0 or 1. In the plot versus hincome , we have a little more variety in the
possible residuals: children can take on two values, and so the residuals can
take on four values for each value of hincome . Even in this extreme case,
however, a correct model requires that the conditional mean function in any
residual plot be constant as we move across the plot. The fitted smooth helps us
learn about the conditional mean function, and neither of the smooths shown is
especially curved. The lack-of-fit test for hincome has a large significance level,
confirming our view that this plot does not indicate lack of fit. The residuals for
children are shown as a boxplot because children is a factor. The boxplots for
children are difficult to interpret because of the discreteness in the distribution
of the residuals.
Figure 6.17 Residual plots for the binary logistic regression fit to the Canadian women’s labor-force
participation data.
6.6.2 INFLUENCE MEASURES
An approximation to Cook’s distance for GLMs is
Call:
The reader can verify that removing just one of the two observations does not
alter the results much, but removing both observations changes the coefficient of
husband’s income by more than 40%, about one standard error. Apparently, the
two cases mask each other, and removing them both is required to produce a
meaningful change in the coefficient for hincome . Cases 76 and 77 are women
working outside the home even though both have children and high-income
husbands.
Figure 6.19 Component-plus-residual plot for assets in the Poisson regression fit to Ornstein’s
interlocking-directorate data.
The component-plus-residual plot for assets is shown in Figure 6.19. This plot
is difficult to interpret because of the extreme positive skew in assets , but it
appears as if the assets slope is a good deal steeper at the left than at the right.
The bulging rule, therefore, points toward transforming assets down the ladder
of powers, and indeed the log-rule in Section 3.4.1 suggests replacing assets by
its logarithm before fitting the regression in the first place (which, of course, is
what we did originally):
> mod.ornstein.pois.2 <− update(mod.ornstein.pois,
+ . ~ log2(assets) + nation + sector)
> crPlots(mod.ornstein.pois.2, "log2(assets)")
The linearity of the follow-up component-plus-residual plot in Figure 6.20
confirms that the log-transform is a much better scale for assets .
The other diagnostics described in Section 6.4 for selecting a predictor
transformation lead to the log-transform as well. For example, the Box-Tidwell
constructed-variable plot for the power transformation of a predictor (introduced
in Section 6.4.2) also extends directly to GLMs, augmenting the model with the
constructed variable xj loge xj. We can use this method with Ornstein’s Poisson
regression:
Figure 6.20 Component-plus-residual plot for the log of assets in the respecified Poisson regression for
Ornstein’s data.
Figure 6.21 Constructed-variable plot for the power transformation of assets in Ornstein’s interlocking-
directorate Poisson regression.
Only the z test statistic for the constructed variable I(assets * log(assets)) is
of interest, and it leaves little doubt about the need for transforming assets . The
constructed-variable plot in Figure 6.21 supports the transformation.
Figure 6.22 Component-plus-residual plot for lwg in the binary logistic regression for Mroz’s women’s
labor force participation data.
The peculiar split in the plot reflects the binary-response variable, with the lower
cluster of points corresponding to lfp = "no" and the upper cluster to lfp =
"yes" . It is apparent that lwg is much less variable when lfp = "no"
, inducing an artifactually curvilinear relationship between lwg and lfp : We
expect fitted values (such as the values of lwg when lfp = "no" ) to be more
homogeneous than observed values, because fitted values lack a residual
component of variation.
We leave it to the reader to construct component-plus-residual or CERES plots
for the other predictors in the model.
where is the estimated error variance, s2j is the sample variance of xj, and 1/(1
− R2j), called the variance inflation factor (VIFj)for bj, is a function of the
multiple correlation Rj from the regression of xj on the other xs. The VIF is the
simplest and most direct measure of the harm produced by collinearity: The
square root of the VIF indicates how much the confidence interval for βj is
expanded relative to similar uncorrelated data, were it possible for such data to
exist. If we wish to explicate the collinear relationships among the predictors,
then we can examine the coefficients from the regression of each predictor with
a large VIF on the other predictors.
The VIF is not applicable, however, to sets of related regressors for multiple-
degree-of-freedom effects, such as polynomial regressors or contrasts
constructed to represent a factor. Fox and Monette (1992) generalize the notion
of variance inflation by considering the relative size of the joint confidence
region for the coefficients associated with a related set of regressors. The
resulting measure is called a generalized variance inflation factor (or GVIF).18
If there are p regressors in a term, then GVIF1/2p is a one-dimensional
expression of the decrease in the precision of estimation due to collinearity—
analogous to taking the square root of the usual VIF. When p = 1, the GVIF
reduces to the usual VIF.
The vif function in the car package calculates VIFs for the terms in a linear
model. When each term has one degree of freedom, the usual VIF is returned,
otherwise the GVIF is calculated.
As a first example, consider the data on the 1980 U.S. Census undercount in
the data frame Ericksen (Ericksen et al., 1989):
These variables describe 66 areas of the United States, including 16 major cities,
the 38 states without major cities, and the remainders of the 12 states that
contain the 16 major cities. The following variables are included:
The vif function can also be applied to GLMs, such as the Poisson-regression
model fit to Ornstein’s data:19
Residuals and residual plotting for linear models are discussed in Weisberg
(2005, sec. 8.1–8.2). Marginal model plots, introduced in Section 6.2.2, are
described in Weisberg (2005, sec. 8.4). Added-variable plots are discussed
in Weisberg (2005, sec. 3.1). Outliers and influence are taken up in
Weisberg (2005, chap. 9).
Diagnostics for unusual and influential data are described in Fox (2008,
chap. 11); for nonnormality, nonconstant error variance, and nonlinearity in
Fox (2008, chap. 12); and for collinearity in Fox (2008, chap. 13). A
general treatment of residuals in models without additive errors, which
expands on the discussion in Section 6.6.1, is given by Cox and Snell
(1968). Diagnostics for GLMs are taken up in Fox (2008, sec. 15.4).
For further information on various aspects of regression diagnostics, see
Cook and Weisberg (1982, 1994, 1997, 1999), Fox (1991), Cook (1998),
and Atkinson (1985).
fitted values. The vector of fitted values is given by = Xb = X( X''X)−1 = Hy, where H = {hij} = X( X'X)
−1X', called the hat-matrix, projects y into the subspace spanned by the columns of the model matrix X.
Because H = H'H, the hat-values hi are simply the diagonal entries of the hat-matrix.
8If vector notation is unfamiliar, simply think of b as the collection of estimated regression coefficients, b ,
0
b1, … , bk.
9* In matrix notation,
10In Chapter 8, we describe how to write a similar function as a preliminary example of programming in R.
11*If T ( y, λ ) |x is normally distributed, then T ( y, λ ) |x cannot be normally distributed for λ ≠ λ ,
BC 0 BC 1 1 0
and so the distribution changes for every value of λ. The method Box and Cox proposed ignores this fact to
get a maximum-likelihood-like estimate that turns out to have properties similar to those of maximum-
likelihood estimates.
12boxCox adds the argument family . If set to the default family="bcPower" , then the function is identical
to the original boxcox . If set to family="yjPower" , then the Yeo-Johnson power transformations are used.
13The value 1.92 is
14Nonlinear least squares is taken up in the online appendix to this Companion.
15The component-plus-residual plot for education in the preceding section reveals that the curvature of the
partial relationship of prestige to education , which is in any event small, appears to change direction—
that is, though monotone is not simple—and so a power transformation is not altogether appropriate here.
16The observant reader will notice that the t values for the constructed-value regression are the same as the
score statistics reported by boxTidwell but that there are small differences in the p values. These
differences occur because boxTidwell uses the standard-normal distribution for the score test, while the
standard summary for a linear model uses the t distribution.
17Essentially the same calculation is the basis of Box and Tidwell’s iterative procedure for finding
transformations in linear least-squares regression (Section 6.4.2).
18*Let R represent the correlation matrix among the regressors in the set in question; R ,the correlation
11 22
matrix among the other regressors in the model; and R, the correlation matrix among all the regressors in
the model. Fox and Monette show that the squared area, volume, or hyper-volume of the joint confidence
region for the coefficients in either set is expanded by the GVIF,
relative to similar data in which the two sets of regressors are uncorrelated with each other. This measure
is independent of the bases selected to span the subspaces of the two sets of regressors and so is
independent, for example, of the contrast-coding scheme employed for a factor.
19Thanks to a contribution from Henric Nilsson.
Drawing Graphs 7
function (x, y = NULL, type = "p", xlim = NULL, ylim = NULL, log = "", main = NULL, sub =
NULL, xlab = NULL, ylab = NULL, ann = par("ann"), axes = TRUE, frame.plot = axes,
panel.first = NULL, panel.last = NULL, asp = NA, …)
NULL
To see in full detail what the arguments mean, consult the documentation for
plot.default ;2 the following points are of immediate interest, however:
The first two arguments, x and y , can provide the horizontal and vertical
coordinates of the points or lines to be plotted, respectively, and also define
a data-coordinate system for the graph. The argument x is required. In
constructing a complex graph, a good initial step is often to use x and y to
establish the ranges for the axes. If we want horizontal coordinates to range
from xmin to xmax and vertical coordinates to range from ymin to ymax ,
then the initial command
> plot(c(xmin, xmax), c(ymin, ymax),
+ type="n", xlab="", ylab="")
is sufficient to set up the coordinate space for the plot, as we will explain in
more detail shortly.
The argument type , naturally enough, determines the type of graph to be
drawn, of which there are several: The default type, "p" , plots p oints at the
coordinates specified by x and y . The character used to draw the points is
given by the argument pch , which can designate a vector of characters of
the same length as x and y , which may therefore differ for different points.
Specifying type="l" (the letter “el”) produces a line graph, and specifying
type="n" , as in the command above, sets up the plotting region to
accommodate the data but plots nothing. Other types of graphs include "b" ,
both points and lines; "o" , points and lines overlaid; "h" , histogram-like
vertical lines; and "s" and "S" , stairstep-like lines, starting horizontally
and vertically, respectively.
The arguments xlim and ylim may be used to define the limits of the
horizontal and vertical axes; these arguments are usually unnecessary
because R will pick reasonable limits from x and y , but they provide an
additional measure of control over the graph. For example, extending the
limits of an axis can provide room for explanatory text, and contracting the
limits can cause some data to be omitted from the graph. If we are drawing
several graphs, we may want all of them to have the same range for one or
both axes, and this can also be accomplished with the arguments xlim and
ylim .
The log argument makes it easy to define logarithmic axes: log="x"
produces a logged horizontal axis; log="y" , a logged vertical axis; and
log="xy" (or log="yx" ), logged axes for both variables. Base−10
logarithms are used, and the conversion from data values to their logs is
automatic.
xlab and ylab take character string or expression arguments, which are
used to label the axes;3 similarly, the argument main may be used to place a
title above the plot, or the title function may be called subsequently to add
main or axis titles. The default axis label, NULL ,is potentially misleading, in
that by default plot constructs labels from the arguments x and y . To
suppress the axis labels, either specify empty labels—e.g., xlab="" —or set
ann=FALSE .
Setting axes=FALSE and frame.plot=FALSE suppresses the drawing of axes
and a box, respectively, around the plotting region. A frame can
subsequently be added by the box function, and axes can be added using the
axis function.
The argument col may be used to specify the color (or colors) for the points
and lines drawn on the plot. (Color specification in R is described in
Section 7.1.4.)
cex (for c haracter ex pansion) specifies the relative size of the points in the
graph; the default size is cex=1 ; cex may be a vector, indicating the size of
each point individually.
The arguments lty and lwd select the type and width of lines drawn on the
graph. (See Section 7.1.3 for more information on drawing lines.)
For example, the following command sets up the blank plot in Figure 7.1, with
axes and a frame but without axis labels:
> plot(c(0, 1), c(0, 1), type="n", xlab="", ylab="")
7.1.2 GRAPHICS PARAMETERS: par
Many of the arguments to plot , such as pch and col , get defaults from the
par function if they are not set directly in the call to plot . The par function is
used to set and retrieve a variety of graphics parameters and thus is similar to the
options function, which sets global defaults for R—for instance,
Figure 7.1 Empty plot, produced by plot(c(0, 1), c(0, 1), type="n", xlab="", ylab="")
> par("col")
[1] "black"
Consequently, unless their color is changed explicitly, all points and lines in a
standard R graph will be drawn in black. To change the general default plotting
color to red, for example, we could enter the command par(col="red") .
To print the current values of all the plotting parameters, call par with no
arguments. Here is a listing of all the graphics parameters:
> names(par())
Table 7.1 presents brief descriptions of some of the plotting parameters that
can be set by par ; many of these can also be used as arguments to plot and
other graphics functions, but some, in particular the parameters that concern the
layout of the plot window (e.g., mfrow ), can only be set using par , and others
(e.g., usr ) only return information and cannot be set by the user. For complete
details on the plotting parameters available in R, see ?par .
Table 7.1 Some plotting parameters set by par . Parameters marked with a * concern the layout of the
graphics window and can only be set using par , not as arguments to higher-level graphics functions such as
plot ; parameters marked with a + return information only.
draws lines at twice their normal thickness. The original setting of par("lwd") is
saved in the variable oldpar , and after the plot is drawn, lwd is reset to its
original value. Alternatively, and usually more simply, closing the current
graphics device window returns the graphical parameters to their default values.
As you might expect, points and lines add points and lines to the current
plot; either function can be used to plot points, lines, or both, but their default
behavior is implied by their names. The argument pch is used to select the
plotting character (symbol), as the following example (which produces Figure
7.2) illustrates:
> plot(1:25, pch=1:25, xlab="Symbol Number", ylab="")
> lines(1:25, type="h", lty="dashed")
The plot command graphs the symbols numbered 1 through 25; because the y
argument to plot isn’t given, an index plot is produced, with the values of x on
the vertical axis plotted against their indices—in this case, also the numbers
from 1 through 25. Finally, the lines function draws broken vertical lines
(selected by lty="dashed" ; see Figure 7.2) up to the symbols; because lines is
given only one vector of coordinates, these too are interpreted as vertical
coordinates, to be plotted against their indices as horizontal coordinates.
Specifying type="h" draws spikes (or h istogram-like lines) up to the points.
One can also plot arbitrary characters, as the following example (shown in
Figure 7.3) illustrates:
> head(letters) # first 6 lowercase letters
Once again, ylab="" suppresses the vertical axis label, axes=FALSE suppresses
tick marks and axes, and frame.plot=TRUE adds a box around the plotting
region, which is equivalent to entering the separate command box() after the
plot command.
As shown in Figure 7.4, several different line types are available in R plots:
> plot(c(1, 7), c(0, 1), type="n", axes=FALSE,
+ xlab="Line Type (lty)", ylab="", frame.plot=TRUE)
> axis(1, at=1:6) # x-axis
> for (lty in 1:6)
+ lines(c(lty, lty, lty + 1), c(0, 0.5, 1), lty=lty)
The lines function connects the points whose coordinates are given by its first
two arguments, x and y . If a coordinate is NA , then the line drawn will be
discontinuous. Line type (lty ) may be specified by number (as here) or by
name, such as "solid" , "dashed" , and so on. Line width is similarly given by
the lwd parameter, which defaults to 1 . The exact effect varies according to the
graphics device used to display the plot, but the general unit seems to be pixels:
Thus, for example, lwd=2 specifies a line 2 pixels wide. We used a for loop (see
Section 8.3.2) to generate the six lines shown in Figure 7.4.
abline
The abline function can be used to add straight lines to a graph.4 We describe
several of its capabilities here; for details and other options, see ?abline .
Figure 7.6 Grid of horizontal and vertical lines created by the grid function.
The grid function can be used to add a grid of horizontal and vertical lines,
typically at the default axis tick mark positions (see ?grid for details); for
example:
> library(car) # for data
> plot(prestige ~ income, type="n", data=Prestige)
> grid(lty="solid")
> with(Prestige, points(income, prestige, pch=16, cex=1.5))
The resulting graph is shown in Figure 7.6. In the call to grid , we specified
lty="solid" in preference to the default dotted lines. We were careful to plot
the points after the grid, suppressing the points in the initial call to plot . We
invite the reader to see what happens if the points are plotted before the grid.
We sometimes find it helpful to use the locator function along with text to
position text with the mouse; locator returns a list with vectors of x and y
coordinates corresponding to the position of the mouse cursor when the left
button is clicked. Figure 7.7b was constructed as follows:
> plot(c(0, 1), c(0, 1), axes=FALSE, type="n", xlab="", ylab="",
+ frame.plot=TRUE, main="(b)")
> text(locator(), c("one", "two", "three"))
To position each of the three text strings, we moved the mouse cursor to a point
in the plot and clicked the left button. Called with no arguments, locator()
returns pairs of coordinates corresponding to left clicks until the right mouse
button is pressed and Stop is selected from the resulting pop-up context menu
(under Windows)or the esc key is pressed (under Mac OS X). Alternatively, we
can indicate in advance the number of points to be returned as an argument to
locator —locator(3) in the current example—in which case, control returns to
the R command prompt after the specified number of left clicks.
Another useful argument to text , not used in these examples, is adj , which
controls the horizontal justification of text: 0 specifies left justification; 0.5 ,
centering (the initial default, given by par ); and 1 , right justification. If two
values are given, adj=c( x , y ) , then the second value controls vertical
justification.
Sometimes we want to add text outside the plotting area. The function mtext
can be used for this purpose; mtext is similar to text , except that it writes in the
margins of the plot. Alternatively, specifying the argument xpd=TRUE to text or
setting the global graphics option par(xpd=TRUE) also allows us to write outside
the normal plotting region.
The col argument, if specified, gives the color to be used in filling the polygon
(see the discussion of colors in Section 7.1.4).
legend
We used locator to position the legend. We find that this is often easier than
computing where the legend should be placed. Alternatively, we can place the
legend by specifying its location to be one of "topleft",
"topcenter","topright", "bottomleft","bottomcenter", or
"bottomright" . If we use one of the corners, the argument inset=0.02 will
inset the legend by 2% of the size of the plot.
curve
The graph in the right-hand panel of Figure 7.11 results from the following
commands:
> curve(sin, 0, 2*pi, ann=FALSE, axes=FALSE, lwd=2)
> axis(1, pos=0, at=c(0, pi/2, pi, 3*pi/2, 2*pi),
+ labels=c(0, expression(pi/2), expression(pi),
+ expression(3*pi/2), expression(2*pi)))
> axis(2, pos=0)
> curve(cos, add=TRUE, lty="dashed", lwd=2)
> legend(pi, 1, lty=1:2, lwd=2, legend=c("sine", "cosine"), bty="n")
The pos argument to the axis function, set to 0 for both the horizontal and the
vertical axes, places the axes at the origin. The argument bty="n" to legend
suppresses the box that is normally drawn around a legend.
Similarly, the gray function creates gray levels from black [gray(0) ]to white
[gray(1) ]:
> gray(0:9/9)
The color codes are represented as hexadecimal (base 16) numbers, of the form
"# RRGGBB " or "# RRGGBBTT " , where each pair of hex digits RR , GG , and BB
encodes the intensity of one of the three additive primary colors— from 00 (i.e.,
0 in decimal) to FF (i.e., 255 in decimal).5 The hex digits TT , if present,
represent transparency, varying from 00 , completely transparent, to FF ,
completely opaque; if the TT digits are absent, then the value FF is implied.
Ignoring transparency, there are over 16 million possible colors.
Specifying colors by name is more convenient, and the names that R
recognizes are returned by the colors function:
> colors()[1:10]
We have shown only the first 10 of over 600 prespecified color names available.
The full set of color definitions appears in the editable file rgb.txt, which resides
in the R etc subdirectory.
The third and simplest way of specifying a color is by number. What the
numbers mean depends on the value returned by the palette function:6
> palette()
Thus, col=c(4, 2, 1) would first use "blue" , then "red" , and finally "black"
. We can enter the following command to see the default palette (the result of
which is not shown because we are restricted to using monochrome graphs):
> pie(rep(1, 8), col=1:8)
R permits us to change the value returned by palette and, thus, to change the
meaning of the color numbers. For example, we used
> palette(rep("black", 8))
to write this Companion, so that all plots are rendered in black and white.7 If you
prefer the colors produced by rainbow , you could set
> palette(rainbow(10))
In this last example, we changed both the palette colors and the number of
colors. The choice
> library(colorspace)
> palette(rainbow_hcl(10))
Most of the analytic and presentation graphs that you will want to create are
easily produced in R. The principal aim of this chapter is to show you how to
construct the small proportion of graphs that require custom work. Such graphs
are diverse by definition, and it would be futile to try to cover their construction
exhaustively. Instead, we will develop an example that uses many of the
functions introduced in the preceding section.
We describe step by step how to construct the diagram in Figure 7.13, which
is designed to provide a graphical explanation of nearest-neighbor kernel
regression, a method of nonparametric regression. Nearest-neighbor
Figure 7.12 The 101 colors produced by gray(0:100/100) , starting with gray(0) (black) and ending with
gray(1) (white).
Figure 7.13 A four-panel diagram explaining nearest-neighbor kernel regression.
Select the grid of points at which to estimate the regression function, either
by selecting a number (say 100) of equally spaced values that cover the
range of x or by using the observed values of x. We follow the latter course
and let x0 be a value from among x1, x2, … , xn, at which we will compute
the corresponding fitted value . The fitted regression simply joins the
points (xi, ), after arranging the x values in ascending order.
Given x0, the estimate is computed as a weighted average of the yi
corresponding to the m closest xi to the focal value x0, called the nearest
neighbors of x0. We set m = [n × s] for a prespecified fraction s of the data,
called the span, where the square brackets represent rounding to the nearest
integer. The span is a tuning parameter that can be set by the user, with
larger values producing a smoother estimate of the regression function. To
draw Figure 7.13, we set s = 0.5, and thus, m = 0.5 × 190 = 95 points
contribute to each local average.9
The identification of the m nearest neighbors of x0 is illustrated in Figure
7.13a for x0 = x(150), the 150th ordered value of GDP, with the dashed
vertical lines in the graph defining a window centered on x(150) that
includes its 95 nearest neighbors. Selecting x0 = x(150) for this example is
entirely arbitrary, and we could have used any other x value in the grid. The
size of the window is potentially different for each choice of x0, but it
always includes the same fraction of the data. In contrast, fixed-bandwidth
kernel regression fixes the size of the window but lets the number of points
used in the average vary.
The scaled distances between each of the xs and the focal x0 are zi = | xi −
x0|/h0, where h0 is the distance between x0 and the most remote of its m
nearest neighbors. Then, the weights, wi, to be used depend on a kernel
function, as in kernel-density estimation (discussed in Section 3.1.2). We
use the tricube kernel function, setting wi = KT( zi), where
The tricube weights, shown in Figure 7.13b, take on the maximum value of
1 at the focal x0 in the center of the window and fall to 0 at the boundaries
of the window.
The y values associated with the m nearest neighbors of x0 are then
averaged, using the tricube weights, to obtain the fitted value
Thus, x0 holds the focal x value, x(150); dist is the vector of distances
between the xs and x0; h is the distance to the most remote point in the
neighborhood for span 0.5 and n = 190; and pick is a logical vector equal to
TRUE for observations within the window and FALSE otherwise.
We draw the first graph using the plot function to define the axes and a
coordinate space:
Figure 7.14 Building up Panel a of Figure 7.13 step by step.
> plot(gdp, infant, xlab="GDP per Capita",
+ ylab="Infant-Mortality Rate", type="n",
+ main="(a) Observations Within the Window\nspan = 0.5")
The \n in the main argument produces a new-line. The result of this command is
shown in the upper-left panel of Figure 7.14. In the upper-right panel, we add
points to the plot, using black for points within the window and light gray for
those outside the window:
> points(gdp[pick], infant[pick], col="black")
> points(gdp[!pick], infant[!pick], col=gray(0.75))
Next, in the lower-left panel, we add a solid vertical line at the focal x0 = x(150)
and broken lines at the boundaries of the window:
> abline(v=x0) # focal x
> abline(v=c(x0 − h, x0 + h), lty="dashed") # window
Finally, in the lower-right panel, we use the text function to display the focal
value x(150) at the top of the panel:
> text(x0, par("usr")[4] + 10, expression(x[(150)]), xpd=TRUE)
The second argument to text , giving the vertical coordinate, makes use of
par("usr") to find the user coordinates of the boundaries of the plotting region.
The command par("usr") returns a vector of the form c( x1 , x2 , y1 , y2 ) ,
and here we pick the fourth element, y2 , which is the maximum vertical
coordinate in the plotting region. Adding 10 —one fifth of the distance between
the vertical tick marks—to this value positions the text a bit above the plotting
region, which is our aim. The argument xpd=TRUE permits drawing outside the
normal plotting region. The text itself is given as an expression , allowing us to
incorporate mathematical notation in the graph, here the subscript (150) , to
typeset the text as x(150).
Panel b of Figure 7.13 is also built up step by step. We begin by setting up the
coordinate space and axes, drawing vertical lines at the focal x0 and at the
boundaries of the window, and horizontal gray lines at 0 and 1:
> plot(range(gdp), c(0, 1),
+ xlab="GDP per Capita", ylab="Tricube Kernel Weight",
+ type="n", main="(b) Tricube Weights")
> abline(v=x0)
> abline(v=c(x0 − h, x0 + h), lty="dashed")
> abline(h=c(0, 1), col="gray")
The first par command leaves room in the top outer margin for the graph title,
which is given in the title command at the end, and establishes the margins for
each panel. The order of margins both for oma (the outer margins) and for mar
(the margins for each panel) are c( bottom , left , top , right ) , and in each
case the units for the margins are lines of text. The fig argument to par
establishes the boundaries of each panel, expressed as fractions of the display
region of the plotting device, in the order c( x-minimum , x-maximum , y-
minimum , y-maximum ) , measured from the bottom-left of the device. Thus, the
first panel, defined by the command par(fig= c(0, 0.5, 0.5, 1)) , extends
from the left margin to the horizontal middle and from the vertical middle to the
top of the plotting device. Each subsequent panel begins with the command
par(new=TRUE) so as not to clear the plotting device, as would normally occur
when a high-level plotting function such as plot is invoked. We use the mtext
command to position the axis labels just where we want them in the margins of
each panel; in the mtext commands, side=1 refers to the left margin and side=2
to the bottom margin of the current panel.
The result is shown in Figure 7.16. The conditioning is a little different from
Figure 4.14, with each panel containing parallel boxplots by sex for each
combination of rank and discipline . The scales argument is a list that
specifies the characteristics of the scales on the axes. For the horizontal or x -
axis, we specified a list with one argument to rotate the level labels by 45◦. For
the vertical or y -axis, we specified two arguments: to rotate the labels to
horizontal and to use a base−10 logarithmic scale. We also applied the
useOuterStrips function from the latticeExtra package (Sarkar and Andrews,
2010)12 to get the levels for the second conditioning variable, discipline ,
printed at the left. The strip.custom function allowed us to change the row
labels to Discipline:A and Discipline:B rather than the less informative A and
B.
Graphs produced with lattice are based on a different metaphor from standard
R graphics, in that a plot is usually specified in a single call to a graphics
function, rather than by adding to a graph in a series of independently executed
commands. As a result, the command to create a customized lattice graph can be
very complex. The key arguments include those for panel functions, which
determine what goes into each panel; strip functions, as we used above, to
determine what goes into the labeling strips; and scale functions, which control
the axis scales. Both the lattice and the latticeExtra packages contain many
prewritten panel functions likely to suit the needs of most users, or you can write
your own panel functions.
In addition to scatterplots produced by xyplot and boxplots produced by
bwplot , as we have illustrated here, lattice includes 13 other high-level plotting
functions, for dot plots, histograms, various three-dimensional plots, and more,
and the latticeExtra package adds several more high-level functions. The book
by Sarkar (2008) is very helpful, providing dozens of examples of lattice graphs.
The lattice package is based on a lower-level, object-oriented graphics system
provided by the grid package, which is described by Murrell (2006, Part II).
Functions in the grid package create and manipulate editable graphics objects,
thus relaxing the indelible-ink-on-paper metaphor underlying basic R graphics
and permitting fine control over the layout and details of a graph. Its power
notwithstanding, it is fair to say that the learning curve for grid graphics is steep.
7.3.2 MAPS
R has several packages for drawing maps, including the maps package
(Becker et al., 2010).13 Predefined maps are available for the world and for
several countries, including the United States. Viewing data on maps can often
be illuminating. As a brief example, the data frame Depredations in the car
package contains data from Harper et al. (2008) on incidents of wolves killing
farm animals, called depredations, in Minnesota for the period 1979–1998:
> head(Depredations)
Figure 7.17 Wolf depredations in Minnesota. The areas of the dots are proportional to the number of
depredations.
The data include the longitude and latitude of the farms where depredations
occurred and the number of depredations at each farm for the whole period
(1979–1998), and separately for the earlier period (1991 or before) and for the
later period (after 1991). Management of wolf-livestock interactions is a
significant public policy question, and maps can help us understand the
geographic distribution of the incidents.
> library(maps)
> par(mfrow=c(1, 2))
> map("county", "minnesota", col=gray(0.4))
> with(Depredations, points(longitude, latitude,
+ cex=sqrt(early), pch=20))
> title("Depredations, 1976−1991", cex.main=1.5)
> map("county", "minnesota", col=grey(0.4))
> with(Depredations,points(longitude, latitude,
+ cex=sqrt(late), pch=20))
> title("Depredations, 1992−1998", cex.main=1.5)
To draw separate maps for the early and late periods, we set up the graphics
device with the mfrow graphics parameter. The map function was used to draw
the map, in this case a map of county boundaries in the state of Minnesota. The
coordinates for the map are the usual longitude and latitude, and the points
function is employed to add points to the plot, with areas proportional to the
number of depredations at each location. We used title to add a title to each
panel, with the argument cex.main to increase the size of the title. The maps tell
us where in the state the wolf-livestock interactions occur and where the farms
with the largest number of depredations can be found. The range of depredations
has expanded to the south and east between time periods. There is also an
apparent outlier in the data—one depredation in the southeast of the state in the
early period.
Choropleth maps, which color geographic units according to the values of one
or more variables, can be drawn using lattice graphics with the mapplot function
in the latticeExtra package. These graphs are of little value without color, so we
don’t provide an example here. See the examples on the help page for mapplot .
Figure 7.18 Scatterplot of prestige by income for the Canadian occupational-prestige data, produced by
the ggplot2 function qplot .
The first command calls the pdf function to open a graphics device of type PDF
(Portable Document Format), which will create the graph as a PDF file in the
working directory. The graph is then drawn by the second command, and finally
the dev.off function is called to close the device and the file. All graphical
output is sent to the PDF device until we invoke the dev.off command. The
completed graph, in mygraph.pdf , can be used like any other PDF file. The
command ?Devices gives a list of available graphics devices, and, for example,
?pdf explains the arguments that can be used to set up the PDF device.
Three useful mechanisms are available for viewing multiple graphs. One
approach is the procedure that we employed in most of this Companion: drawing
several graphs in the same window, using par(mfrow=c( rows , columns )) , for
example, to divide a graphics device window into panels.
A second approach makes use of the graphics history mechanism. Under
Windows, we activate the graphics history by selecting History → Recording
from the graphics device menus; subsequently, the PageUp and Page-Down keys
can be used to scroll through the graphs. Under Mac OS X,we can scroll through
graphs with the command-← and command-→ key combinations when a
graphics device window has the focus.
The final method is to open more than one graphics window, and for this we
must open additional windows directly, not rely on a call to plot or a similar
high-level graphics function to open the windows. A new graphics window can
be created directly in the Windows version of R by the windows function and
under Mac OS X by the quartz function. If multiple devices are open, then only
one is active and all others are inactive. New graphs are written to the active
graphics device. The function dev.list returns a vector of all graphics devices
in use; dev.cur returns the number of the currently active device; and dev.set
sets the active device.
1See Sections 1.4 and 8.7 for an explanation of how generic functions and their methods work in R.
2In general, in this chapter we will not discuss all the arguments of the graphics functions that we describe.
Details are available in the documentation for the various graphics functions. With a little patience and trial
and error, we believe that most readers of this book will be able to master the subtleties of R documentation.
3An expression can be used to produce mathematical notation in labels, such as superscripts, subscripts, and
Greek letters. The ability of R to typeset mathematics in graphs is both useful and unusual. For details, see ?
plotmath and Murrell and Ihaka (2000); also see Figures 7.11 (p. 342) and 7.13 (p. 345) for examples.
4When logarithmic axes are used, abline can also draw the curved image of a straight line on the original
scale.
5Just as decimal digits run from 0 through 9, hexadecimal digits run from 0 through 9, A, B, C, D, E, F,
representing the decimal numbers 0 through 15. The first hex digit in each pair is the 16s place and the
second is the ones place of the number. Thus, e.g., the hex number #39 corresponds to the decimal number 3
× 16 + 9 × 1 = 57.
6At one time, the eighth color in the standard R palette was "white" . Why was that a bad idea?
7More correctly, all plots that refer to colors by number are black and white. We could still get other colors
by referring to them by name or by their RGB values. Moreover, some graphics functions select colors
independently of the color palette.
8We previously encountered the scatterplot for these two variables in Figure 3.15 (p. 129).
9For values of x in the middle of the range of x, typically about half the nearest neighbors are smaller than
0
x0 and about half are larger than x0,butfor x0 near the minimum (or maximum) of x, almost all the nearest
neighbors will be larger (or smaller) than x0, and this edge effect will introduce boundary bias into the
estimated regression function. By fitting a local-regression line rather than a local average, the lowess
function reduces bias near the boundaries. Modifying Figure 7.13 to illustrate nearest-neighbor local-linear
regression rather than kernel regression is not hard: Simply fit a WLS regression, and compute a fitted value
at each focal x. We leave this modification as an exercise for the reader.
10The ifelse command is described in Section 8.3.1.
11for loops are described in Section 8.3.2.
12While the lattice package is part of the standard R distribution, the latticeExtra package must be
obtained from CRAN.
13The maps package is not part of the standard R distribution, so you must obtain it from CRAN.
Writing Programs 8
Data sets are frequently encountered that must be modified before they
can be used by lm or glm . For example, you may need to change values
of −99 to the missing value indicator NA , or you may regularly want to
recode a variable with values 0 and 1 to the more informative labels Male
and Female . Writing functions for these kinds of data management
operations, and others that are more complicated, can automate the
process of preparing similar data sets.
Output from some existing R functions must be rearranged for
publication. Writing a function to format the output automatically in the
required form can save having to retype it.
A graph is to be drawn that requires first performing several
computations to produce the right data to plot and then using a sequence
of graphics commands to complete the plot. A function written once can
automate these tasks.
A simulation is required to investigate robustness or estimator error, or
otherwise to explore the data. Writing a function to perform the
simulation can make it easier to vary factors such as parameter values,
estimators, and sample sizes.
A nonstandard model that doesn’t fit into the usual frameworks can be fit
by writing a special-purpose function that uses one of the available
function optimizers in R.
This list is hardly exhaustive, but it does illustrate that writing functions can
be useful in a variety of situations.
The goal of this chapter is to provide you with the basics of writing R
functions to meet immediate needs in routine and not so routine data analysis.
Although this brief introduction is probably not enough for you to write
polished programs for general use, it is nevertheless worth cultivating good
programming habits, and the line between programs written for one’s own use
and those written for others is often blurred. Recommendations for further
reading on R programming are given at the end of the chapter.
When the function is called, its formal arguments are set equal to real
arguments, either supplied explicitly or given by default.2 For example:
> inflPlot(lm(prestige ~ income + education, data=Duncan))
> m1 <− lm(prestige ~ income + education, data=Duncan)
> inflPlot(m1)
> inflPlot(model=m1)
> inflPlot(m1, scale=10, col=c(1, 2), identify=TRUE,
+ labels=names(rstud))
All these commands produce the same plot (not shown). In the first command,
the first formal argument is replaced by the linear-model object returned by the
command lm(prestige~income + education, data=Duncan) , and all the
remaining arguments are replaced by their default values. Before the second
inflPlot command, an object named m1 is created that is the result of fitting the
same linear model as in the first command. This object is supplied as the model
argument in the second inflPlot command. The third inflPlot command
explicitly assigns the argument model=m1 , while the first two commands relied
on matching the model argument by its position in the sequence of formal
arguments. It is common in R commands to use positional matching for the first
two or three arguments, and matching by name for the remaining arguments.
Matching by name is always required when the arguments are supplied in a
different order from the order given in the function definition. Named arguments
may be abbreviated as long as the abbreviation is unique. For example, because
no other argument to inflPlot begins with the letter s, the argument scale may
be abbreviated to scal , sca , sc ,or s . The final inflPlot command explicitly
sets the remaining arguments to their default values.
The formal ellipses argument … is special in that it may be matched by any
number of real arguments when the function is called, and it is commonly used
to soak up extra arguments to be passed to other functions. In inflPlot , the
inclusion of … allows adding arguments that are passed to the plot function
without specifying those arguments in the definition of inflPlot . When
inflPlot is called without additional arguments, then none are passed to plot .3
Along with function arguments, variables defined within the body of a
function are local to the function: They exist only while the function executes
and do not interfere with variables of the same name in the global environment;
for example:
> squareit <− function(x) {
+ y <− x^2
+ y
+ }
> y<− 3
> squareit(x=y)
[1] 9
> y
[1] 3
The value of y is unchanged in the global environment even though the local
variable y in the squareit function took on a different value.
Although formal arguments are associated with real arguments when a
function is called, an argument is not evaluated until its first use. At that point,
the argument is evaluated in the environment from which the function was
called. In contrast, default values for arguments, if they exist, are evaluated in
the environment of the function itself. This process of lazy evaluation frequently
proves efficient and convenient. The default value of one argument can depend
on another argument or even on a value computed in the body of the function. In
inflPlot , for example, the default value of labels references the local variable
rstud . Lazy evaluation can trip up the unwary programmer, however.4
Reviewing the inflPlot function in Figure 8.1 line by line may be helpful,
even though most of the R commands used in the function are self-explanatory:
The first few lines of the function, which begin with the character # ,are
comments, used in the program code to explain what the function does and
what the arguments mean. This is generally a good practice if you plan to
save the function.
The next three lines use the R functions hatvalues , rstudent ,and
cooks.distance to extract the hat-values, Studentized residuals, and
Cook’s distances from the model that was passed as an argument to
inflPlot .
The value of scale is used to determine the sizes of the points to be drawn,
and it is computed to be the default size given as an argument divided by
the square root of the maximum value of Cook’s distance.
The argument na.rm=TRUE is required in the max function or the scale factor
would be NA if missing values are present.
As explained in Chapter 7, the graph is drawn by first calling plot and then
other functions to add features to the plot. The axis labels will be Hat-
Values and Studentized Residuals , the size of the plotted points is given
by scale*cook , and the color of the points depends on whether or not cook
> cutoff .
The abline function is used in inflPlot to add vertical and horizontal
reference lines to the plot—the former at twice and three times the average
hat-value and the latter at Studentized residuals of −2, 0, and 2 and at the
Bonferroni cutoffs for a two-sided outlier test at the .05 level. The argument
lty="dashed" specifies broken lines.
If the argument identify is TRUE , then points with noteworthy values of
Cook’s D, Studentized residuals, or hat-values are labeled with their
observation names, taken by default from names(studres) , the names of
the Studentized residuals, which are the row names of the data frame given
as an argument to lm or glm . The label is positioned to the left of the point
if its hat-value is greater than the midrange of the hat-values (i.e., the
average of the minimum and maximum values), and to the right otherwise.
Figure 8.2 A graph produced by the inflPlot function.
The resulting graph appears in Figure 8.2. We set col=gray(c(0.5, 0)) , which
corresponds to medium gray and black, because the default colors in R—black
and red—would not reproduce properly in the book; specifying ylim=c(−3, 4)
expands the range of the vertical axis, and las=1 makes all tick mark labels
horizontal. These are specified here to illustrate the use of arguments passed
down to plot via … .
Using * to multiply two matrices of the same order forms their element-wise
product. The standard matrix product is computed with the inner-product
operator, %*% , which requires that the matrices be conformable for
multiplication:
[1] 1 1 1
[1] 1 5 3
> C %*% a
[,1]
[1,] 0
[2,] 1
[3,] 4
> a %*% C
> a %*% b
[,1]
[1,] 9
The last of these examples illustrates that the inner product of two vectors of the
same length, a %*% b , is a scalar—actually a 1 × 1 matrix in R.
The outer product of two vectors may be obtained via the outer function:
The outer function may also be used with operations other than multiplication;
an optional third argument, which defaults to "*" , specifies the function to be
applied to pairs of elements from the first two arguments.
The function t returns the transpose of a matrix:
The fractions function in the MASS package may be used to display numbers
as rational fractions, which is often convenient when working with simple matrix
examples:
The solve function may also be used more generally to solve systems of
linear simultaneous equations. If C is a known square and nonsingular matrix, b
is a known vector or matrix, and x is a vector or matrix of unknowns, then the
solution of the system of linear simultaneous equations Cx = b is x = C−1b,
which is computed in R by
> solve(C, b)
[1] 2.875 2.375 4.500
[,1]
[1,] 2.875
[2,] 2.375
[3,] 4.500
> head(y)
Selecting the single column for the response prestige from the data frame
Prestige produces a vector rather than a one-column matrix because R by
default drops dimensions with extent one. We can circumvent this behavior by
specifying drop=FALSE (see Section 2.3.4 on indexing), which not only produces
a matrix (actually, here, a data frame) with one column but also retains the row
labels:
> head(Prestige[ , "prestige", drop=FALSE])
prestige
gov.administrators 68.8
general.managers 69.1
accountants 63.4
purchasing.officers 56.8
chemists 73.5
physicists 77.6
Although some R functions are fussy about the distinction between a vector and
a single-column matrix, in the current example either will do.
The usual formula for the least-squares coefficients is b = ( X′X)−1X′y.It is
simple to write this formula directly as an R expression:
> solve(t(X) %*% X) %*% t(X) %*% y
[,1]
−6.794334
education 4.186637
income 0.001314
women −0.008905
Call:
lm(formula = prestige ~ education + income + women, data = Prestige)
[1] 0.4365
Depending on its argument, the diag function may be used to extract or to set
the main diagonal of a matrix, to create a diagonal matrix from a vector, or to
create an identity matrix of specified order:
The MASS package includes a function, ginv , for computing generalized
inverses of square and rectangular matrices. Advanced facilities for matrix
computation, including for efficiently storing and manipulating large, sparse
matrices, are provided by the Matrix package.
8.3.1 CONDITIONALS
The basic construct for conditional evaluation in R is the if statement, which
takes one of the following two general forms:
1. if ( logical.condition ) command
2. if ( logical.condition ) command else alternative.-command
[1] 5
> abs1(5)
[1] 5
Of course, in a real application we would use the abs function in R for this
purpose.
When abs1 is applied to a vector, it does not produce the result that we
probably intended, because only the first element in the condition x < 0 is used.
In the illustration below, the first element is less than 0, and so the condition
evaluates to TRUE ; the value returned by the function is therefore the negative of
the input vector. A warning message is also printed by R:
> abs1(−3:3) # wrong! the first element, −3, controls the result
[1] 3 2 1 0 −1 −2 −3
Warning message:
In if (x < 0) −x else x :
the condition has length > 1
and only the first element will be used
[1] 3 2 1 0 1 2 3
[1] 0.254
[1] 0.254
[1] 0.9144
[1] 91.44
[1] 8047
Error in match.arg(units) :
’arg’ should be one of "inches", "feet", "yards", "miles"
n! = n×( n − 1) ×… × 2 × 1 for n ≥ 1
0! = 1
This, too, is an artificial problem: We can calculate the factorial of n very easily
in R, as factorial(n) ;or as gamma(n + 1) ; or, for n > 0, less efficiently as
prod(1:n) .In fact1 , we initialize the local variable f to 1 , then accumulate the
factorial product in the loop, and finally implicitly return the accumulated
product as the result of the function. It is also possible to return a result explicitly
—for example, return(1) in the first line of the fact1 function—which causes
execution of the function to cease at that point.
The function fact1 does not verify that its argument is a nonnegative integer,
and so, for example,
> fact1(5.2)
[1] 120
returns 5! = 120 rather than reporting an error. This incorrect result occurs here
because 1:5.2 expands to the values 1, 2, 3, 4, 5 , effectively replacing the
argument x=5.2 by truncating to the largest integer less than or equal to x . To
return an incorrect answer would be unacceptable in a general-purpose program.
For a quick-and-dirty program written for our own use, however, this might be
acceptable behavior: We would only need to ensure that the argument to the
function is always a nonnegative integer.
Here is another version of the program, adding an error check:
The function stop ends execution of a function immediately and prints its
argument as an error message. The double-or operator || differs from | in two
respects:
Analogous comments apply to the double-and operator && : The right argument
of && is evaluated only if its left argument is TRUE .
The general format of the for statement is
The indentation of the lines in the function reflects its logical structure, such as
the scope of the while loop. R does not enforce rules for laying out functions in
this manner, but programs are much more readable if the program code is
indented to reveal their structure.
The general format of a while loop is
repeat command
8.3.3 RECURSION
Recursive functions call themselves. Recursion is permissible in R and can
provide an elegant alternative to looping. We ignore error checking for clarity:
> fact5 <− function(x){
+ if (x <= 1) 1 # termination condition
+ else x * fact5(x − 1) # recursive call
+ } >
fact5(5)
[1] 120
trace: fact5(5)
trace: fact5(x − 1)
trace: fact5(x − 1)
trace: fact5(x − 1)
trace: fact5(x − 1)
[1] 120
> untrace(fact5)
A potential pitfall of having a function call itself recursively is that the name
of the function can change by assignment (here to fact6 ):
> fact6 <− fact5
> remove(fact5)
> fact6(5) # tries to call the removed fact5
[1] 120
[1] 120
In contexts in which we know that a function will not be renamed, we prefer not
to use Recall .
We can calculate the scale score for each subject by applying the sum function
over the rows (the first coordinate) of the data frame:
> DavisThin$thin.drive <− apply(DavisThin, 1, sum)
> head(DavisThin$thin.drive, 10)
[1] 0 0 0 0 0 1 8 19 3 15
We have chosen to add a variable called thin.drive to the data frame rather
than to define the scale in the working data; consequently, the new eighth
column of the data frame is named thin.drive and has values that are the row
sums of the preceding seven columns.
Similarly, if we are interested in the column means of the data frame, they
may be simply calculated as follows, by averaging over the second (column)
coordinate:
In these two simple cases, we can more efficiently use the functions row-Sums
and colMeans ; for example:
If we simply apply sum over the rows of the data frame, then the result will be
missing for observations with any missing items, as we can readily verify:
A simple alternative is to average over the items that are present, multiplying
the resulting mean by 7 (to restore 0–21 as the range of the scale); this procedure
can be implemented by defining an anonymous function in the call to apply :
The anonymous function has a single argument that will correspond to a row of
the data frame; thus, in this case we compute the mean of the nonmissing values
in a row and then multiply by 7, the number of items in the scale. The
anonymous function disappears after apply is executed.
Suppose that we are willing to work with the average score if more than half
of the 7 items are valid but want the scale to be NA if there are 4 or more missing
items:
The lapply and sapply functions are similar to apply but reference the
successive elements of a list. To illustrate, we convert the data frame DavisThin
to a list:
The list elements are the variables from the data frame. We used str (see Section
2.6) to avoid printing the entire contents of thin.list . To calculate the mean of
each list element eliminating missing data:
> lapply(thin.list, mean, na.rm=TRUE)
$DT1
[1] 0.466
$DT2
[1] 1.026
$DT3
[1] 0.9577
$DT4
[1] 0.3439
$DT5
[1] 1.116
$DT6
[1] 0.932
$DT7
[1] 0.5654
In this example, and for use with apply as well, the argument na.rm=TRUE is
passed to the mean function, so an equivalent statement would be
> lapply(thin.list, function(x) mean(x, na.rm=TRUE))
The lapply function returns a list as its result; sapply works similarly, but
tries to s implify the result, in this case returning a vector with named elements:
The mapply function is similar to sapply but is m ultivariate in the sense that it
processes several vector arguments simultaneously. Consider the integrate
function, which approximates definite integrals numerically and evaluates an
individual integral.5 The function dnorm computes the density function of a
standard-normal random variable. To integrate this function between –1.96 and
1.96:
> (result <− integrate(dnorm, lower=−1.96, upper=1.96))
The printed representation of the result of this command is shown above. The
class of the returned value result is "integrate" , and it is a list with five
elements:
The element result$value contains the value of the integral.
To compute areas under the standard-normal density for a number of intervals
—such as the adjacent, nonoverlapping intervals ( −∞, −3), ( −3, −2), ( −2, −1), (
−1, 0), ( 0, 1), ( 1, 2), ( 2, 3), and (3, ∞)—we can vectorize the computation with
mapply :
The factor partner.status has levels "low" and "high" ; the factor fcategory
has levels "low" , "medium" , and "high" ; and the response, conformity , is a
numeric variable. We may, for example, use tapply to calculate the mean
conformity within each combination of levels of partner.status and
fcategory . We first redefine the factor fcategory so that its levels are in their
natural order rather than in the default alphabetical order:
The factors by which to cross-classify the data are given in a list as the second
argument to tapply ; names are supplied optionally for the list elements to label
the output. The third argument, the mean function, is applied to the values of
conformity within each combination of levels of part-ner.status and
fcategory .
Not only does the second solution fail for a problem of this magnitude on the
system on which we tried it, a 32-bit Windows system, but it is also slower on
smaller problems. We invite the reader to redo this problem with ten thousand 10
× 10 matrices, for example.
A final note on the problem:
> # opaque & wastes memory!
> S <− rowSums(array(unlist(matrices),
+ dim = c(10, 10, 10000)), dims = 2)
is approximately as fast as the loop for the smaller version of the problem but
fails on the larger one.
The lesson: Avoid loops when doing so produces clearer and possibly more
efficient code, not simply to avoid loops.
All the derivatives in Equation 8.1 are evaluated at bt. The Taylor series
approximates g( β) using the quadratic polynomial in β on the right side of
Equation 8.1. The remainder represents the error in approximation. We do this
because finding the maximum of a quadratic polynomial when we ignore the
remainder is easy: Simply differentiate the right side of Equation 8.1 with
respect to β, set the result to 0, and solve to get
Provided that the second derivative is negative, bt+1 is the new guess at the
maximizer of g; we repeat this procedure, stopping when the value of bt+1
doesn’t change much. The Newton-Raphson algorithm works remarkably well in
many statistical problems, especially problems with a well-behaved likelihood
like most GLMs.
The generalization of Newton-Raphson to many parameters starts with a
vector version of Equation 8.1,
again with the derivatives evaluated at bt. Here, β and bt are ( k + 1) ×1 vectors.8
The vector of first derivatives in Equation 8.3 is ( k + 1) ×1 and is called the
gradient or score vector. The second derivatives multiplied by −1 compose the (
k + 1) ×( k + 1) Hessian matrix. The Newton-Raphson update is
where X is the model matrix, with xi as its ith row; y is the response vector,
containing 0s and 1s, with ith element yi; pt is the vector of fitted response
probabilities from the last iteration, the ith entry of which is
and Vt is a diagonal matrix, with diagonal entries pit(1 − pit). Equation 8.4
becomes
Equation 8.7 is fit repeatedly until bt is close enough to bt−1. At convergence, the
estimated asymptotic covariance matrix of the coefficients is given by the
inverse of the Hessian matrix ( X′VX)−1, which is, conveniently, a by-product of
the procedure.
An implementation of Newton-Raphson for the binary logistic-regression
problem is shown in Figure 8.3. Input to the function consists of a matrix X of
the regressors and a vector y of the 0−1 responses, and so this function works for
binary logistic regression, not the more general binomial logistic regression. The
next two arguments control the computation, specifying the maximum number of
iterations and a convergence tolerance. Newton-Raphson is remarkably stable
for GLMs, and leaving max.iter at its default value of 10 is usually adequate.
The final argument is a flag for printing information about the values of the
estimated parameters at each iteration, with the default being not to print this
information.
Figure 8.3 The lreg1 function implementing the Newton-Raphson algorithm for binary logistic regression.
The function only fits models with an intercept, and to accomplish this, a
vector of 1s is appended to the left of X in the first noncomment line in the
function. The next line initializes the current and previous values of b to vectors
of 0s. Unlike most problems to which Newton-Raphson is applied, setting the
starting values b0 = 0 is good enough to get convergence for most GLMs. The
Newton-Raphson algorithm proper is inside the while loop. We continue as long
as neither convergence nor max.iter is reached. If verbose=TRUE , the cat
function is used to print information; \n is the new-line character and is required
because cat does not automatically start each printed output on a new line. The
remainder of the loop does the computations. The if statement is the test for
convergence by computing the maximum absolute proportional change in the
coefficients and comparing it with the value of tol ; if convergence is reached,
then break is executed and the loop is exited.9 Finally, the returned value is a list
with the components coefficients , var for the inverse of the Hessian, and it
for the number of iterations required by the computation.
This function uses both memory and time inefficiently by forming the n×n
diagonal matrix V, even though only its n diagonal elements are nonzero. More
efficient versions of the function will be presented later in this chapter
(particularly in Section 8.6.2).
To illustrate the application of lreg1 , we return to Mroz’s labor force
participation data, employed as an example of logistic regression in Section 5.3:
The response variable, lfp , and two of the predictors, wc and hc , are factors.
Unlike glm , the lreg1 function will not process factors, and so these variables
must be converted to numeric dummy regressors (but see Section 8.8, where we
explain how to write functions that use R model formulas). This task is easily
accomplished with ifelse commands:
The negative log-likelihood and the negative gradient are defined as local
functions, negLogL and grad , respectively. Like local variables, local
functions exist only within the function in which they are defined.
Even though X and y are local variables in lreg2 , they are passed as
arguments to negLogL and grad , along with the parameter vector b . This is
not strictly necessary (see the discussion of lexical scoping in R in Section
8.9), but doing so allows us to show how to pass additional arguments
through the optimizer.
The optim function in R provides several general optimizers. We have had
good luck with the BFGS method for this kind of problem, and consequently,
we have made this the default, but by providing a method argument to
lreg2 and passing this argument down to optim , we also have made it easy
for the user of lreg2 to substitute another method. See ?optim for details.
– The first argument to optim gives starting values for the parameter
estimates, in this case a vector of 0s.
– The second argument gives the objective function to be minimized, the
local function negLogL , and the third argument gives the gradient,
gr=grad . The first argument of the objective function and gradient must
be the parameter vector—b in this example. If the gradient is not given
as an argument, optim will compute it numerically.
– Specifying hessian=TRUE asks optim to return the Hessian, the inverse
of which provides the estimated covariance matrix of the coefficients.
The Hessian is computed numerically: optim does not allow us to
supply an expression for the Hessian. Because we know the Hessian
from Equation 8.6 (p. 387), we could have computed it outside of lreg2
, but in many optimization problems the Hessian will be much more
complicated, and letting the optimizer approximate it numerically is
usually a reasonable approach.
– As explained, the method argument specifies the optimization method to
be employed.
– The two remaining arguments, the model matrix X and the response
vector y , are passed by optim to negLogL and grad .
optim returns a list with several components. We pick out and return the
parameter estimates, the Hessian, the value of the objective function at the
minimum (which is used to compute the deviance), and a code indicating
whether convergence has been achieved.
One interesting feature of the lreg2 function in Figure 8.4 is that no explicit
error checking is performed. Even so, if errors are committed, error messages
will be printed. For example, suppose that we replace the response variable, lfp
, by a character vector with "yes" and "no" elements:
> Mroz$LFP <− ifelse(Mroz$lfp==0, "no", "yes")
The functions called by lreg2 do check for input errors, and in this case, the
error occurred when the negLogL function was first executed by optim . The
error message was informative, but unless we build in our own error checking,
we can sometimes produce very obscure error messages.
ESTIMATION BY ITERATED WEIGHTED LEAST SQUARES
A third approach to the problem, which we will leave as an exercise for the
reader, is to use IWLS to compute the logistic-regression coefficients, as glm
does. The relevant formulas for binomial logistic regression are given in Section
5.12. For GLMs with canonical link functions, which includes logistic
regression, IWLS provides identical updates to Newton-Raphson (McCullagh
and Nelder, 1989, p. 43).
Thus, for example, the number 210,363,258 would be rendered “two hundred ten
million, three hundred sixty-three thousand, two hundred fifty-eight.” There
really is no point in going beyond trillions, because the double-precision
floating-point numbers used by R can represent integers exactly only to about 15
decimal digits, or hundreds of trillions (see Section 2.6.2). Of course, we could
allow numbers to be specified optionally by arbitrarily long character strings of
numerals (e.g., "210363258347237492310" ). We leave that approach as an
exercise for the reader: It would not be difficult to extend our program in this
manner by allowing the user to specify the additional necessary suffixes
(quadrillions, quintillions, etc.).
One approach to converting numbers into words would be to manipulate the
numbers as integers, but it seems simpler to convert numbers into character
strings of numerals, which can then be split into individual characters: (1) larger
integers can be represented exactly as double-precision floating-point numbers
rather than as integers in R; (2) it is easier to manipulate the individual numerals
than to perform repeated integer arithmetic to extract digits; and (3) having the
numerals in character form allows us to take advantage of R’s ability to index
vectors by element names, as we will describe shortly.
We therefore define the following function to convert a number into a vector
of characters containing the numerals composing the number. The function first
converts the number into a character string with as.character and then uses
strsplit to divide the result into separate characters, one for each digit.
> makeDigits <− function(x) strsplit(as.character(x), "")[[1]]
> makeDigits(−123456)
> makeDigits(1000000000)
The second and third examples reveal that makeDigits has problems with
negative numbers and with large numbers that R renders in scientific notation.
By setting the scipen (sci entific-notation pen alty) option to a large number,
we can avoid the second problem:
> options(scipen=100)
> makeDigits(1000000000)
[1] "1" "0" "0" "0" "0" "0" "0" "0" "0" "0"
[1] 12345
5
"five"
> teens["3"]
3
"thirteen"
> tens["7"]
7
"seventy"
Figure 8.5 lists a function for converting a single integer into words; we have
added line numbers to make it easier to describe how the function works. First,
here are some examples of its use:
> number2words(123456789)
> number2words(−123456789)
Figure 8.5 The number2words function.
[1] "minus one hundred twenty-three million,
four hundred fifty-six,
thousand seven hundred eighty-nine"
> number2words(−123456000)
The first five lines of the function are essentially self-explanatory. The rest of
the function probably requires some explanation, however:
[6] If the number is composed of a single digit, then we can find the answer
by simply indexing into the vector ones ; the function as.-vector is used
to remove the name (i.e., the numeral used as a label) of the selected
element.
[7-9] If the number is composed of two digits and is less than or equal to 19,
then we can get the answer by indexing into teens with the last digit (i.e.,
the second element of the digits vector). If the number is 20 or larger, then
we need to attach the tens digit to the ones digit, with a hyphen in between.
If, however, the ones digit is 0 , ones["0"] is "zero" , and thus we have an
embarrassing result, such as "twenty-zero" . More generally, the program
can produce spurious hyphens, commas, spaces, and the strings ", zero"
and "-zero" . Our solution is to write a function trim to remove the
unwanted characters. The trim function on lines [25−27] makes use of R’s
ability to manipulate text by processing regular expressions (see Section
2.4).
[10-14] If the number consists of three digits, then the first digit is used for
hundreds, and the remaining two digits can be processed as an ordinary
two-digit number. This is done by a recursive call to number2words , unless
the last two digits are both 0 , in which case, we don’t need to convert them
into words. The hundreds digit is then pasted onto the representation of the
last two digits, and the result is trimmed. The makeNumber function is used
to put the last two digits back into a number (assigned to the variable tail
). We do not bother to use the Recall mechanism for a recursive function
call because eventually number2words will become a local function, which
is therefore not in danger of being renamed.
[15-22] If the number contains more than three digits, then we are in the
realm of thousands, millions, and so on. The computation on line [16]
determines with which power of 1,000 we are dealing. Then, if the number
is not too large, the appropriate digits are stripped off from the left of the
number and attached to the proper suffix; the remaining digits to the right
are recomposed into a number and processed with a recursive call, to be
attached at the right.
[23] Finally, if the original number was negative, then the word "minus" is
pasted onto the front before the result is returned.
Figure 8.6 displays a function, called numbers2words , that adds some bells
and whistles. The various vectors of names are defined locally in the function;
the utility functions trim , makeNumber , and makeDigits are similarly defined as
local functions; and the function number2words , renamed helper , is also made
local. Using a helper function rather than a direct recursive call permits efficient
vectorization, via sapply , at the end of numbers2words .Were numbers2words to
call itself recursively, the local definitions of objects (such as the vector ones and
the function trim ) would be recomputed at each call, rather than only once.
Because of R’s lexical scoping (see Section 8.9), objects defined in the
environment of numbers2words are visible to helper .
The function numbers2words includes a couple of additional features. First,
according to the Oxford English Dictionary, the definition of “billion” differs in
the United States and (traditionally) in Britain: “1. orig. and still commonly in
Great Britain: A million millions. (= U.S. trillion.) … 2. In U.S., and
increasingly in Britain: A thousand millions.” Thus, if the argument billion is
set to "UK" , a different vector of suffixes is used. Moreover, provision is made to
avoid awkward translations that repeat the word “million,” such as “five
thousand million, one hundred million, …,” which is instead, and more properly,
rendered as “five thousand, one hundred million, … .”
The traceback function supplies a little more information about the context of
the error:
Figure 8.8 The bugged lreg3 function, with a call to browser inserted.
The problem is apparently in the command var.b <− solve(t(X) %*% V %*%
X) in lreg3 , but what exactly is wrong here was not immediately clear to us.
Using an editor or the fix function, we insert a call to browser immediately
before the error, as shown in Figure 8.8. We have “outdented” the line containing
the call to browser to help us remember to remove it after the function is
debugged. Executing the function now causes it to pause before the offending
line:
> mod.mroz.1b <− with(Mroz,
+ lreg3(cbind(k5, k618, age, wc, hc, lwg, inc), lfp))
Called from: lreg3(cbind(k5, k618, age, wc, hc, lwg, inc), lfp) Browse[1]>
The Browse[1]> prompt indicates that the interpreter is waiting for input in
browser mode. We can type the names of local variables to examine their
contents or, indeed, evaluate any R expression in the environment of the lreg3
function; for example, entering the objects() command would list the
function’s local objects. We can also type any of several special browser
commands followed by the Enter key:
c , cont , or just Enter: Continue execution. In our case, this would simply
result in the original error.
n : Enter a step-by-step debugger, in which the function continues execution
one line at a time, as if a call to browser were inserted before each
subsequent line. In the step-by-step debugger, the meaning of the c and
cont browser commands changes: Rather than execution continuing to the
end of the function, it instead continues to the end of the current context—
essentially the next right brace } —such as the end of a loop. Moreover, n
and the Enter key simply execute the next command.
where : Indicates where execution has stopped in the stack of pending
function calls—similar to the output produced by traceback .
Q : Quit execution of the function, and return to the R command prompt.
num 0.25
num [1:753, 1] 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 …
The source of the problem is now clear: p is a 753×1 matrix, not a vector, and
thus p*(1 − p) is also a 753 × 1 matrix. In some instances, R treats vectors and
single-row or single-column matrices differently, and this is one of those cases.
The diag function applied to a vector of length n returns an n×n diagonal matrix,
while the diag function applied to an n×1 or 1×n matrix returns the “diagonal"
of the matrix, which is just the [1, 1] element. The solution is to coerce the
product to a vector, p <− as.vector(1/(1 + exp(-X %*% b))) , returning us to
the correct definition of lreg1 (in Figure 8.3, p. 388).
As an alternative to inserting calls to browser into the source code of a
function, we can instead invoke the debug function—for example, debug
(lreg3) . When we next execute lreg3 , we can use the step-by-step debugger
commands to move through it, examining the values of local variables, and so
on, as we proceed.
Another alternative to inserting calls to browser is to set the error option to
dump.frames . We can subsequently use the postmortem debugger function to
examine the local state of a program at the point of an error. In the current
illustration, a dialog with debugger might begin as follows:
Very similar results are obtained by setting options(error=recover) ; see ?
recover for details.
The three functions produce the same answer within rounding error, and the
estimated regression coefficients are all close to 1, as anticipated, but lreg1 is
extremely inefficient relative to the other two functions.
The Rprof function profiles R code; that is, it provides an accounting of where
a function spends its time and where it allocates memory. Profiling is not
infallible, but it generally helps locate bottlenecks in execution time and
problematic memory use. Profiling lreg1 produces these results:
Profiling is started by a call to the Rprof function. Rprof interrupts the execution
of lreg1 at regular intervals, with a default of 20 milliseconds; writes
information about the current state of the computation to a file; and optionally
tracks changes in memory allocation. A second call to Rprof turns off profiling.
The summaryRprof function summarizes the time and memory audit created by
Rprof .
In our example, we use the tempfile command to create a temporary file to
hold the results; this file can grow very large for a lengthy computation and is
later deleted by the unlink function. The argument memory="both" to
summaryRprof asks for a summary of all memory allocated in each function call.
The summaryRprof command reports several kinds of information:
Figure 8.9 Another modification of lreg1 , without computing the diagonal matrix V .
total.time : The time in seconds spent in each function, including the time
spent in functions called by that function.
total.pct : The percentage of total time spent in each function, again
including the time spent in functions called by the function.
self.time : The time spent in each function exclusive of the time spent in
functions called by the function.
self.pct : The percentage of total time spent exclusively in each function;
these percentages should sum to 100.
mem.total : The total amount of memory, in megabytes, allocated in each
function.
The list returned by summaryRprof contains three elements: sorting the results
by self.time and by total.time , and indicating the total execution time
(sampling.time ). The results are incomplete in the sense that not every function
called in the course of the computation is represented; for example, cbind is
missing. If we decreased the sampling interval, then we would obtain a more
complete accounting, but if the interval is made too short, the results can become
inaccurate. As we would expect, 100% of the time is spent in the lreg1 function
and in the functions that it calls. Beyond that observation, it is clear that most of
the time, and almost all the memory, is consumed by the computation of the
diagonal matrix V . Because all the off-diagonal entries of this n × n matrix are 0,
repeatedly forming the matrix and using it in matrix multiplication are wasteful
in the extreme.
Figure 8.9 displays a version of the logistic-regression function that avoids
forming V and that also substitutes the slightly more efficient function crossprod
for some of the matrix multiplications:12
Indeed, lreg4 is so fast for our relatively small problem that we had to reduce
the sampling interval (to 2 milliseconds) to get the profiling to work!
function (object, …)
UseMethod("summary")
<environment: namespace:base>
function (x, …)
UseMethod("print")
<environment: namespace:base>
Our print.lreg method prints a brief report, while the output produced by
summary.lreg is more extensive. The functions cat , print , and printCoefmat
are used to produce the printed output. We are already familiar with the generic
print function. The cat function may also be used for output to the R console.
Each new-line character (\n ) in the argument to cat causes output to resume at
the start of the next line. The printCoefmat function prints coefficient matrices
in a pleasing form. As we explained, it is conventional for the arguments of a
method to be the same as the arguments of the corresponding generic function,
and thus we have arguments x and … for print.lreg , and object and … for
summary.lreg .
It is also conventional for print methods to return their first argument as an
invisible result and for summary methods to create and return objects to be
printed by a corresponding print method. According to this scheme,
summary.lreg returns an object of class "summary.lreg" , to be printed by the
print method print.summary.lreg . This approach produces summary objects
that can be used in further computations. For example,
summary(mod.mroz.3)$coef[ , 3] returns the column of z values from the
coefficient table.
The first argument to setClass is the name of the class being defined, here
"lreg5" . The second argument calls the representation function to define the
slots that compose objects of class "lreg5" ; each argument to representation
is a slot name and identifies the kind of data that the slot is to contain—for
example a numeric vector, a matrix, or a character vector.
Our S4 object-oriented logistic-regression program, named lreg5 and
displayed in Figure 8.12, is similar to the S3 function lreg.default (Figure
8.10, p. 408). The lreg5 function creates the class "lreg5" object result by
calling the general object-constructor function new and supplying the contents of
each slot, which are automatically checked for appropriateness against the class
representation. The lreg5 function terminates by returning the object result .
Let us try out lreg5 on Mroz’s data:
Figure 8.12 An S4 version of our logistic-regression program.
In S3, typing the name of an object or entering any command that is not an
assignment causes the generic print function to be invoked. Typing the name of
an object in S4 similarly invokes the show function. Because we have not yet
defined a "show" method for objects of class "lreg5" , the default method—
which in S4 is the function simply named show —would be invoked and would
just display the values of all the slots of mod.mroz.4 . The show S4 generic
function has the following definition:
> show
function (object)
standardGeneric("show")
<environment: 0x12da7c0>
Methods may be defined for arguments: object
Use showMethods("show") for currently available ones.
(This generic function excludes non-simple inheritance; see ?setIs)
We proceed to define a "show" method for objects of class "lreg5" using the
setMethod function:
The first argument, "show" ,to setMethod gives the name of the method that
we wish to create. The second argument indicates the signature of the method—
the kind of objects to which it applies. In our example, the function applies only
to "lreg5" objects, but S4 permits more complex signatures for functions with
more than one argument. The final argument to setMethod defines the method
function. This may be a preexisting function or, as here, an anonymous function
defined on the fly. Methods in S4 must use the same arguments as the generic
function, for example, the single argument object for a "show" method. The @
(at sign) operator is used to extract the contents of a slot, much as $ (dollar sign)
is used to extract a list element.
Let us verify that the new method works properly:
The "show" method for objects of class "lreg5" reports only the regression
coefficients. We next define a "summary" method that outputs more information
about the logistic regression:
Because the generic summary function has two arguments, object and … , so
must the method, even though … is never used in the body of the method. The
argument … can be used to soak up additional arguments for different methods in
both S3 and S4 generic functions. Applying summary to the model produces the
desired result:
Finally, a word about inheritance in S4. Recall that in S3, an object can have
more than one class. The first class is the object’s primary class, but if a method
for a particular generic function does not exist for the primary class, then
methods for the second, third, and so on, classes are searched for successively. In
S4, in contrast, each object has one and only one class. Inheritance is a
relationship between classes and not a property of objects. If one class extends
another class, then the first class inherits the methods of the second. Inheritance
is established by the setIs function: setIs("classA", "classB") asserts that
"classA" extends, and therefore can inherit methods from, "classB" ; put
another way, objects of class "classA" also belong to class "classB" .
The object-oriented programming system in S4 is more complex than in S3—
indeed, we have only scratched the surface here, showing how to do in S4 what
we previously learned to do in S3. For example, unlike in S3, S4 method
dispatch can depend on the class of more than one argument, and instances of S4
objects are checked automatically for consistency with the class definition. The
S4 object system has been used to develop some important R software, such as
the lme4 package for fitting linear and generalized linear mixed-effects
models.14 Nevertheless, most current object-oriented software in R still uses the
older S3 approach.
The logistic-regression programs that we have presented thus far put the burden
on the user to form the response vector y of 0s and 1s; to construct the model
matrix X—including generating dummy variables from factors, creating
interaction regressors, and handling polynomials in a numerically stable way; to
select cases to use in a particular analysis; and so on. In addition, our functions
make no provision for missing data, an annoying oversight that will plague the
data analyst in most real problems. Most, but not all, of the modeling functions
we use in R do these things for us, and in a more or less consistent way.
Figure 8.13 A formula method for lreg , which uses the standard arguments for R modeling functions.
The first argument, x , is a numeric matrix of predictors, and the second, y ,is the
response variable; the remaining arguments control the details of model fitting.
The function Glmnet in Figure 8.14 acts as a front-end to glmnet , facilitating
the specification of the model and data. The ellipses argument … in Glmnet is
used to pass arguments to glmnet . The response, on the left side of the model
formula, should be a numeric variable for family= "gaussian" ; a two-column
matrix, a factor with two levels, or a vector of 0s and 1s for family="binomial"
; or a factor with more than two levels for family="multinomial" .
For example, applying Glmnet to the Prestige data in the car package:
> g1 <− Glmnet(prestige ~ income + education + women + type,
+ data=Prestige)
> class(g1)
The Glmnet command uses the model formula to create the arguments x and y
for glmnet from the Prestige data frame. For example, the factor type in the
data frame is converted into dummy variables. In constructing the matrix of
predictors, the column for the intercept is suppressed, because the glmnet
function includes an intercept in the model: Having a vector of 1s in the
predictor matrix x would be redundant. The last line of Glmnet is a call to glmnet
, and so the Glmnet function returns the result produced by glmnet ; the object g1
, therefore, contains all the output from glmnet , which we do not describe here
but which can be plotted or otherwise examined in the usual manner.
The material in Section 2.2 that describes how objects are located along the R
search path should suffice for the everyday use of R in data analysis. Sometimes,
however, the behavior of R seems mysterious because of the scoping rules that it
uses. In this section, we provide further discussion of scoping rules, in particular
with respect to the manner in which the values of variables are determined when
functions are executed. This material, while relatively difficult, can be important
in writing R programs and in understanding their behavior.
> a <− 10
> x <− 5
> f(2)
[1] 12
[1] 12
[1] 7
The local function g is defined within the function f , and so the environment of
g comprises the local frame of g , followed by the local frame of f , followed by
the global environment. Because a is a free variable in g , the interpreter next
looks for a value for a in the local frame of f ; it finds the value a ≡ 5 , which
shadows the global binding a ≡ 10 .
Lexical scoping can be quite powerful and even mind-bending. Consider the
following function:
> makePower <− function(power) {
+ function(x) x^power
+ }
The makePower function returns a closure as its result (not just a function, but a
function plus an environment):
> square <− makePower(2)
> square
function(x) x^power
<environment: 0x3d548b8>
> square(4)
[1] 16
function(x) x^power
<environment: 0x3c8b878>
> cuberoot(64)
[1] 4
When makePower is called with the argument 2 (or 1/3 ), this value is bound to
the local variable power . The function that is returned is defined in the local
frame of the call to makePower and therefore inherits the environment of this call,
including the binding of power . Thus, even though the functions square and
cuberoot look the same, they have different environments: The free variable
power in square takes on the value 2 , while the free variable power in cuberoot
takes on the value 1/3 .
We conclude this section with one additional example. Tukey’s test for non-
additivity is computed by the residualPlots function in the car package (as
described in Section 6.2). The outline of the test is as follows: (1) fit a linear
model; (2) compute the squares of the fitted values corresponding to the original
data points, ; (3) update the model by adding to it as a new regressor; and
(4) compute Tukey’s test as the t value for the additional regressor, with a p
value from the standard-normal distribution. For example, using the Canadian
occupational-prestige data from the car package:
> m1 <− lm(prestige ~ income + education + women + type,
+ data=na.omit(Prestige))
> yhat2 <− predict(m1)^2
> m2 <− update(m1, . ~ . + yhat2)
> test <− summary(m2)$coef["yhat2", 3]
> pval <− 2*pnorm(abs(test), lower.tail=FALSE)
> c(TukeyTest=test, Pvalue=pval)
TukeyTest Pvalue
−2.901494 0.003714
The small p value suggests that the linear predictor does not provide an adequate
description of the expected value of prestige given the predictors. We used
na.omit(Presige) to replace the original data frame with a new one, deleting all
rows with missing values; otherwise, because of missing data in type , the fitted
values yhat2 would have been of the wrong length (see Sections 2.2.3 and
4.8.5).
Tukey’s test is sufficiently useful that we might want to write a function for its
routine application. Copying the code for the example almost exactly, we have
If we are careful to delete the variable named yhat2 from the global frame
before trying to run this program, we get
> remove(yhat2)
> tukeyTest(m1)
The problem is that yhat2 is defined in the local frame of the function, while
update is acting on model , which is in the global frame, and thus it can’t find
variables in the local frame. The result is the error message.
A quick-and-dirty solution to this problem is to define yhat2 in the global
frame and then delete it before leaving the function:
The function getAnywhere can find objects that would otherwise be hidden in
the namespace of a package. The tukeyNonaddTest function isn’t exported from
the car package because it isn’t meant to be called directly by the user.19
1All R functions return values, even if the value returned is NULL or invisible (or, as here, both NULL and
invisible). The most common side effects in R—that is, effects produced by functions other than returned
values—are the creation of printed output and graphs.
2It is possible to call a function without supplying values for all its arguments, even arguments that do not
have defaults. First, arguments are evaluated only when they are first used in a function (a process called
lazy evaluation, described later), and therefore if an argument is never used, it need not have a value.
Second, the programmer can use the missing function to test whether an argument has been specified and
proceed accordingly, avoiding the evaluation of an argument not given in the function call.
3A relatively subtle point is that … can be used only for arguments that aren’t explicitly specified in the plot
command in inflPlot . For example, were we to attempt to change the axis labels by the command
inflPlot(m1, xlab="Leverages", ylab="Stud. Res.") , relying on … to absorb the xlab and ylab
arguments, an error would be produced, because plot would then be called with duplicate xlab and ylab
arguments. Duplicate arguments aren’t permitted in a function call. If we really thought it worthwhile to
allow the user to change the axis labels, we could explicitly provide xlab and ylab arguments to inflPlot ,
with the current labels as defaults, and pass these arguments explicitly to plot .
4Environments in R are discussed in Section 8.9.
5This example is adapted from Ligges and Fox (2008). For readers unfamiliar with calculus, integration
finds areas under curves—in our example, areas under the standard-normal density curve, which are
probabilities.
6The material in this section is adapted from Ligges and Fox (2008).
7The system.time function reports three numbers, all in seconds: user time is the time spent executing
program instructions, system time is the time spent using operating-system services (e.g., file input/output),
and elapsed time is clock time. Elapsed time is usually approximately equal to, or slightly greater than, the
sum of the other two, but because of small measurement errors, this may not be the case.
8We anticipate applications in which β is the parameter vector in a regression model with a linear predictor
that includes k regressors and a constant.
9The convergence test could be incorporated into the termination condition for the while loop, but we
wanted to illustrate breaking out of a loop. We invite the reader to reprogram lreg1 in this manner. Be
careful, however, that the loop does not terminate the first time through, because b and b.last both start at
0.
10In addition to the optim function, the standard R distribution includes nlm and nlminb for general
optimization. The functions constrOptim , for constrained optimization, and mle (in the stats4 package),
for maximum-likelihood estimation, provide front-ends to the optim function. The optimize function does
one-dimensional optimization. Additional optimizers are available in several contributed packages. See the
CRAN Optimization Task View at http://cran.r-project.org/web/views/Optimization.html .
11Lest we be accused of cynicism, let us explain that we extrapolate here from our own experience.
12crossprod multiplies its second argument on the left by the transpose of its first argument.
13See Section 1.4 for an explanation of how S3 object dispatch works.
14Mixed-effects models are discussed in the online appendix to the text.
15More details are available in the R Project Developer Page at http://developer.r-
project.org/model-fitting-functions.txt .
16Even though the first argument of the S3 generic function lreg is X , it is permissible to use the name
formula for the first argument of the formula method lreg.formula . The other argument of the generic, … ,
also appears in lreg.formula , at it should, and is passed through in the call to lreg.default .
17Another common rule is dynamic scoping, according to which the environment of a function comprises
the local frame of the function followed by the environment from which the function was called, not, as in
lexical scoping, the environment in which it was defined. Dynamic scoping, which is not used in R, is less
powerful than lexical scoping for some purposes, but it is arguably more intuitive.
18If R were dynamically scoped, as described in Footnote 17, then when g is called from f , the interpreter
would look first for a free variable in the frame of f .
19Package namespaces are discussed in the Writing R Extensions manual (R Development Core Team,
2009b).
20See Graham (1994, 1996) for an eloquent discussion of these points in relation to another functional
programming language—Lisp.
21This dictum, and a great deal of other good advice on programming, originates in Kernighan and Plauger
(1974); also see Kernighan and Pike (1999).
22The process of creating an R package is described in the Writing R Extensions manual (R Development
Core Team, 2009b).
References
Abelson, H., Sussman, G. J., and Sussman, J. (1985). Structure and Interpretation of Computer Programs.
MIT Press, Cambridge, MA.
Adler, D. and Murdoch, D. (2010). rgl: 3D visualization device system (OpenGL).R package version 0.89.
Agresti, A. (2002). Categorical Data Analysis. Wiley, Hoboken, NJ, second edition. Agresti, A. (2007).
An Introduction to Categorical Data Analysis. Wiley, Hoboken, NJ, second edition.
Agresti, A. (2010). Analysis of Ordinal Categorical Data. Wiley, Hoboken, NJ, second edition.
Andrews, F. (2010). playwith: A GUI for interactive plots using GTK+. R package version 0.9-45.
Atkinson, A. C. (1985). Plots, Transformations and Regression: An Introduction to Graphical Methods of
Diagnostic Regression Analysis. Clarendon Press, Oxford. Becker, R. A., Chambers, J. M., and Wilks,
A. R. (1988). The New S Language: A Programming Environment for Data Analysis and Graphics.
Wadsworth, Pacific Grove, CA.
Becker, R. A. and Cleveland, W. S. (1996). S-Plus Trellis Graphics User’s Manual. Seattle.
Becker, R. A., Wilks, A. R., Brownrigg, R., and Minka, T. P. (2010). maps: Draw Geographical Maps. R
package version 2.1-4.
Berk, R. A. (2008). Statistical Learning from a Regression Perspective. Springer, New York.
Berndt, E. R. (1991). The Practice of Econometrics: Classic and Contemporary. Addison-Wesley,
Reading, MA.
Birch, M. W. (1963). Maximum likelihood in three-way contingency tables. Journal of the Royal
Statistical Society. Series B (Methodological), 25(1):220–233. Bowman, A. W. and Azzalini, A.
(1997). Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus
Illustrations. Oxford University Press, Oxford.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical
Society. Series B (Methodological), 26(2):211–252.
Box, G. E. P. and Tidwell, P. W. (1962). Transformation of the independent variables. Technometrics,
4(4):531–550.
Braun, W. J. and Murdoch, D. J. (2007). A First Course in Statistical Programming with R. Cambridge
University Press, Cambridge, UK.
Breusch, T. S. and Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient
variation. Econometrica, 47(5):1287–1294.
Campbell, A., Converse, P. E., Miller, P. E., and Stokes, D. E. (1960). The American Voter. Wiley, New
York.
Chambers, J. M. (1992). Linear models. In Chambers, J. M. and Hastie, T. J., editors, Statistical Models in
S, pages 95–144. Wadsworth, Pacific Grove, CA. Chambers, J. M. (1998). Programming with Data: A
Guide to the S Language. Springer, New York.
Chambers, J. M. (2008). Software for Data Analysis: Programming with R. Springer, New York.
Chambers, J. M., Cleveland, W. S., Kleiner, B., and Tukey, P. A. (1983). Graphical Methods for Data
Analysis. Wadsworth, Belmont, CA.
Chambers, J. M. and Hastie, T. J., editors (1992). Statistical Models in S.Wadsworth, Pacific Grove, CA.
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the
American Statistical Association, 74(368):829–836. Cleveland, W. S. (1993). Visualizing Data. Hobart
Press, Summit, NJ.
Cleveland, W. S. (1994). The Elements of Graphing Data, Revised Edition. Hobart Press, Summit, NJ.
Clogg, C. C. and Shihadeh, E. S. (1994). Statistical Models for Ordinal Variables. Sage, Thousand Oaks,
CA.
Cook, D. and Swayne, D. F. (2009). Interative Dynamic Graphics for Data Analysis: With R and GGobi.
Springer, New York.
Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1):15–18.
Cook, R. D. (1993). Exploring partial residual plots. Technometrics, 35(4):351–362. Cook, R. D. (1998).
Regression Graphics: Ideas for Studying Regressions Through Graphics. Wiley, New York.
Cook, R. D. and Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall, New
York.
Cook, R. D. and Weisberg, S. (1983). Diagnostics for heteroscedasticity in regression. Biometrika,
70(1):1–10.
Cook, R. D. and Weisberg, S. (1991). Added variable plots in linear regression. In Stahel, W. and
Weisberg, S., editors, Directions in Robust Statistics and Diagnostics, Part I, pages 47–60, New York.
Springer.
Cook, R. D. and Weisberg, S. (1994). Transforming a response variable for linearity. Biometrika,
81(4):731–737.
Cook, R. D. and Weisberg, S. (1997). Graphics for assessing the adequacy of regression models. Journal
of the American Statistical Association, 92(438):490–499. Cook, R. D. and Weisberg, S. (1999).
Applied Regression Including Computing and Graphics. Wiley, New York.
Cunningham, R. and Heathcote, C. (1989). Estimating a non-gaussian regressino model with
multicolinearity. Australian Journal of Statistics, 31:12–17.
Davis, C. (1990). Body image and weight preoccupation: A comparison between exercising and non-
exercising women. Appetite, 15:13–21.
Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge
University Press, Cambridge.
Efron, B. (2003). The statistical century. In Panaretos, J., editor, Stochastic Musings: Perspectives from the
Pioneers of the Late 20th Century, pages 29–44. Lawrence Erlbaum Associates, New Jersey.
Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall, New York.
Ericksen, E. P., Kadane, J. B., and Tukey, J. W. (1989). Adjusting the 1990 Census of Population and
Housing. Journal of the American Statistical Association, 84: 927–944.
Fienberg, S. E. (1980). The Analysis of Cross-Classified Categorical Data, Second Edition. MIT Press,
Cambridge, MA.
Fox, J. (1987). Effect displays for generalized linear models. In Clogg, C. C., editor, Sociological
Methodology 1987 (Volume 17), pages 347–361. American Sociological Association, Washington,
D.C.
Fox, J. (1991). Regression Diagnostics: An Introduction. Sage, Newbury Park, CA. Fox, J. (2000a). A
Mathematical Primer for Social Statistics. Sage, Thousand Oaks, CA.
Fox, J. (2000b). Nonparametric Simple Regression: Smoothing Scatterplots. Quantitative Applications in
the Social Sciences. Sage, Thousand Oaks, CA.
Fox, J. (2003). Effect displays in R for generalised linear models. Journal of Statistical Software, 8(15):1–
27.
Fox, J. (2005a). Programmer’s Niche: How do you spell that number? R News, 5(1):51–55.
Fox, J. (2005b). The R commander: A basic-statistics graphical user interface to R. Journal of Statistical
Software, 14(9):1–42.
Fox, J. (2008). Applied Regression Analysis, Linear Models, and Related Methods. Sage, Thousand Oaks,
CA, second edition.
Fox, J. (2009). Aspects of the social organization and trajectory of the R Project. The R Journal, 1(2).
Fox, J. and Andersen, R. (2006). Effect displays for multinomial and proportional-odds logit models.
Sociological Methodology, 36:225–255.
Fox, J. and Guyer, M. (1978). “Public” choice and cooperation in n-person prisoner’s dilemma. The
Journal of Conflict Resolution, 22(3):469–481.
Fox, J. and Monette, G. (1992). Generalized collinearity diagnostics. Journal of the American Statistical
Association, 87(417):178–183.
Freedman, D. and Diaconis, P. (1981). On the histogram as a density estimator. Zeitschrift fur
Wahrscheinlichkeitstheorie und verwandte Gebiete, 57:453–476. Freedman, J. L. (1975). Crowding
and Behavior. Viking, New York.
Friedman, J., Hastie, T., and Tibshirani, R. (2008). Regularization paths for generalized linear models via
coordinate descent. unpublished.
Gentleman, R. (2009). R Programming for Bioinformatics. Chapman and Hall, Boca Raton.
Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. Wiley,
Hoboken, NJ.
Graham, P. (1994). On Lisp: Advanced Techniques for Common Lisp. Prentice Hall, Englewood Cliffs, NJ.
Graham, P. (1996). ANSI Common Lisp. Prentice Hall, Englewood Cliffs, NJ. Harper, E. K., Paul, W. J.,
Mech, L. D., and Weisberg, S. (2008). Effectiveness of lethal, directed wolf-depredation control in
minnesota. Journal of Wildlife Management, 72(3):778–784.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer, New York, second edition. Huber, P. and Ronchetti, E. M. (2009).
Robust Statistics. Wiley, Hoboken, NJ, second edition.
Ihaka, R. and Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of
Computational and Graphical Statistics, 5:299–314.
Jones, O., Maillardet, R., and Robinson, A. (2009). Introduction to Scientific Programming and Simulation
Using R. Chapman and Hall, Boca Raton.
Kernighan, B. W. and Pike, R. (1999). The Practice of Programming. Addison-Wesley, Reading, MA.
Kernighan, B. W. and Plauger, P. J. (1974). The Elements of Programming Style. McGraw-Hill, New York.
Landwehr, J. M., Pregibon, D., and Shoemaker, A. C. (1980). Some graphical procedures for studying a
logistic regression fit. In Proceedings of the Business and Economics Statistics Section, American
Statistical Association, pages 15–20. Leisch, F. (2002). Sweave, part I: Mixing R and LATEX. R News,
2(3):28–31.
Leisch, F. (2003). Sweave, part II: Package vignettes. R News, 3(2):21–24.
Lemon, J. (2006). Plotrix: a package in the red light district of R. R News, 6(4):8–12. Ligges, U. and Fox,
J. (2008). R help desk: How can I avoid this loop or make it faster? R News, 8(1):46–50.
Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. Wiley, Hoboken NJ,
second edition.
Loader, C. (1999). Local Regression and Likelihood. Springer, New York.
Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Sage, Thousand
Oaks, CA.
Long, J. S. and Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear
regression model. The American Statistician, 54(3):217–224. Mallows, C. L. (1986). Augmented
partial residuals. Technometrics, 28(4):313–319. McCullagh, P. and Nelder, J. A. (1989). Generalized
Linear Models, Second Edition. Chapman & Hall, London.
Moore, D. S. and McCabe, G. P. (1993). Introduction to the Practice of Statistics, Second Edition.
Freeman, New York.
Moore, J. C., Jr. and Krupat, E. Relationship between source status, authoritarianism, and conformity in a
social setting. Sociometry, 34: 122–134.
Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression: A Second Course in Statistics.
Addison-Wesley, Reading, MA.
Mroz, T. A. (1987). The sensitivity of an empirical model of married women’s hours of work to economic
and statistical assumptions. Econometrica, 55(4):765–799. Murrell, P. (2006). R Graphics. Chapman
and Hall, Boca Raton.
Murrell, P. and Ihaka, R. (2000). An approach to providing mathematical annotation in plots. Journal of
Computational and Graphical Statistics, 9(3):582–599. Nelder, J. A. (1977). A reformulation of linear
models. Journal of the Royal Statistical Society. Series A (General), 140(1):48–77.
Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal
Statistical Society. Series A (General), 135(3):370–384.
Ornstein, M. D. (1976). The boards and executives of the largest Canadian corporations: Size,
composition, and interlocks. Canadian Journal of Sociology,1: 411–437.
Phipps, P., Maxwell, J. W., and Rose, C. (2009). 2009 annual survey of the mathematical sciences. Notices
of the American Mathematical Society, 57:250–259. Powers, D. A. and Xie, Y. (2000). Statistical
Methods for Categorical Data Analysis. Academic Press, San Diego.
Pregibon, D. (1981). Logistic regression diagnostics. The Annals of Statistics, 9(4):705–724.
R Development Core Team (2009a). R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051−07−0.
R Development Core Team (2009b). Writing R Extensions. R Foundation for Statistical Computing,
Vienna, Austria.
Ripley, B. D. (2001). Using databases with R. R News, 1(1):18–20.
Rizzo, M. L. (2008). Statistical Computing with R. Chapman and Hall, Boca Raton. Sall, J. (1990).
Leverage plots for general linear hypotheses. The American Statistician, 44(4):308–315.
Sarkar, D. (2008). Lattice: Multivariate Data Visualization with R. Springer, New York.
Sarkar, D. and Andrews, F. (2010). latticeExtra: Extra Graphical Utilities Based on Lattice. R package
version 0.6–11.
Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Chapman and Hall, New York.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
Simonoff, J. S. (2003). Analyzing Categorical Data. Springer, New York.
Spector, P. (2008). Data Manipulation with R. Springer, New York.
Stine, R. and Fox, J., editors (1996). Statistical Computing Environments for Social Research. Sage,
Thousand Oaks, CA.
Swayne, D. F., Cook, D., and Buja, A. (1998). XGobi: Interactive dynamic data visualization in the X
Window system. Journal of Computational and Graphical Statistics, 7(1):113–130.
Tierney, L. (1990). Lisp-Stat: An Object-Oriented Environment for Statistical Computing and Dynamic
Graphics. Wiley, Hoboken, NJ.
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT.
Tukey, J. W. (1949). One degree of freedom for non-additivity. Biometrics, 5(3): 232–242.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley, Reading, MA. Urbanek, S. and
Wichtrey, T. (2010). iplots: iPlots—interactive graphics for R.R package version 1.1-3.
Velilla, S. (1993). A note on the multivariate box-cox transformation to normality. Statistics and
Probability Letters, 17:259–263.
Venables, W. N. and Ripley, B. D. (2000). S Programming. Springer, New York. Venables, W. N. and
Ripley, B. D. (2002). Modern Applied Statistics with S. Springer, New York, fourth edition.
Wang, P. C. (1985). Adding a variable in generalized linear models. Technometrics, 27:273–276.
Wang, P. C. (1987). Residual plots for detecting nonlinearity in generalized linear models. Technometrics,
29(4):435–438.
Weisberg, S. (2004). Lost opportunities: Why we need a variety of statistical languages. Journal of
Statistical Software, 13(1):1–12.
Weisberg, S. (2005). Applied Linear Regression. John Wiley & Sons, Hoboken, NJ, third edition.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for
heteroskedasticity. Econometrica, 48(4):817–838.
Wickham, H. (2009). ggplot2: Using the Grammar of Graphics with R. Springer, New York.
Wilkinson, G. N. and Rogers, C. E. (1973). Symbolic description of factorial models for analysis of
variance. Journal of the Royal Statistical Society. Series C (Applied Statistics), 22(3):392–399.
Wilkinson, L. (2005). The Grammar of Graphics. Springer, New York, second edition.
Williams, D. A. (1987). Generalized linear model diagnostics using the deviance and single case deletions.
Applied Statistics, 36:181–191.
Yeo, I.-K. and Johnson, R. A. (2000). A new family of power transformations to improve normality or
symmetry. Biometrika, 87(4):954–959.
Zeileis, A., Hornik, K., and Murrell, P. (2009). Escaping RGBland: Selecting colors for statistical graphics.
Computational Statistics & Data Analysis, 53:3259–3270.
Author Index
B-splines, 178
backward elimination, 209
bandwidth, 111, 134
Bayesian information criterion (BIC), 210
binary response, 233
and binomial response, 244–246
bindings, 417, 419
binomial distribution, 112
binomial regression, 240–246
overdispersed, 277
binomial response, 233, 240
and binary response, 244–246
Bioconductor Project, xiii, 92
bits, 92
Bonferroni adjustment, 295, 318
bootstrap, xviii, 187–188, 295
bottom-up programming, 422
boundary bias, 346
Box-Cox (BC) transformations, 131, 303, 309
Box-Tidwell method, 312–314
boxplots, 114–115, 161, 353, 354
parallel, 121
braces, curly, 111, 360, 374
browser commands, 400–401
bubble-plot, 299
bugs, 18
See also debugging bulging rule, 138, 310
bytes, 92
F distribution, 112
factors, 24, 47, 67, 121, 157–171
polynomial contrasts for, 216
coding, 213–220
ordered, 216
releveling, 162, 167, 203, 214
user defined contrasts for, 219
family-generator function, 233
floating point numbers, 92, 102, 103, 393
formal arguments, to functions, 16, 360
formulas, 151
for graphs, 123
for scatterplots, 116, 118, 125
in lattice graphics, 352
one-sided, 125
See also model formulas
formula methods, 414
Fortran, xii, 7, 377, 424
forward selection, 212
frame, 417
free variables, 418
function arguments, 16, 360–362
functional programming language, 7
functions, 4–7, 14–16, 360–364
anonymous, 26, 378, 412
generic, 6, 10, 15, 24, 31, 37–41, 99, 110, 123, 192, 205, 287, 330, 360–364, 407, 411, 414
graphing, 341–342
method, 39, 99, 110, 407, 411–414
Java, 357
jittering points, 119
joint confidence regions, 189
justification of text, in graphs, 334, 339
lack of fit, testing for, 246, 252, 254, 289, 319, 320
ladder of powers, 132
lasso, 416
lattice graphics, 352–354
lazy evaluation, 361, 363
lazy data, 55
leakage, in component-plus-residual plots, 311
legends, in graphs, 171, 204, 268, 341–342, 353
length of objects, 96
levels, of factors, 121, 157–158, 162
leverage, 33, 116, 155, 286, 296–298, 360
in generalized linear models, 318, 320
leverage plots, 294
lexical scoping, 396, 418–422
library, vs. package, xvii likelihood ratio tests, 66, 141–142, 191–192
for generalized linear models, 238–239, 243, 253, 267, 271, 272
for selecting transformations, 305–306
line segments, 339
line types, 121, 332, 334, 336
line width, 111, 332, 334, 336
linear algebra, 364–370
linear models, 149–228
linear predictor, 150
linear regression
computing, 367–369
multiple, 155–157, 174
simple, 151–155, 174
linear simultaneous equations, 367
linear-model object, 152
lines, 335, 337
link function, 230–231, 233–234
Linux, xiv, 92
Lisp, 423
Lisp-Stat, xii local functions, 391, 396, 419, 422
local linear regression, 117
See also lowess local variables, 16, 362, 391, 396
log rule, 129
log-odds, 133, 234
logarithms, 5–6, 127–131
logical data, 11–14, 44, 48, 67, 78, 80, 226
logical operators, 13, 374
logistic distribution, 270
logistic regression, 229, 233–246
and loglinear models, 256–258
binomial, 284
multinomial, 259–264
nested dichotomies, 264–269
programming, 386–392
proportional-odds, 269–272
logit, 133, 137, 234
empirical, 134
logit link, 234
logit model. See logistic regression loglinear models, 246, 250–259
and logistic regression, 256–258
preparing data files for, 258
response variables in, 256
sampling plans for, 255–256
for three-way tables, 253
for two-way tables, 250
long integers, 102
loops, 373–375, 382–385
avoiding, 377, 382, 423
lowess, 26–27, 64–66, 117, 121, 126, 148, 345–346, 352, 356
Mac OS X, xiv, xvi–xviii, 1, 17, 20, 22, 29, 50, 54, 64, 92, 146, 189, 339, 357, 358
machine learning, 208
Mahalanobis distance, 124, 147
main diagonal, 370
main effect, 194
maps, 354–356
marginal-model plots, 290–292
marginality principle, 194, 210, 213, 253, 256
margins of plots, 334
text in, 339
masked objects, 59, 418
matching operator, 74
mathematical notation in graphs, 332, 349
matrix algebra, 364–370
matrix inverse, 366
matrix product, 365
matrix transpose, 366
maximum likelihood, 231
mean function, 231
mean-shift outlier model, 294
megabyte (MB), 93
memory use, 92–94, 96, 383, 404–406
meta-characters, 88
method functions, 39, 99, 110, 407, 411–414
missing data, 48, 62–67, 226–227
mixed-effects models, xviii, 413
mode, of objects, 96
model, statistical, 149
model formulas, 30, 151–152, 156, 159, 162, 164, 166, 223–225, 233, 236, 240, 327, 414–415
model selection, 207–213, 228
multinomial distribution, 255
multinomial-logit model, 258–264, 270, 272
multiple imputation, 149
multiple regression, 155
multiple-document interface (MDI), xv multiplicative errors, 130
multivariate data, plotting, 124–126
multivariate linear models, xviii multivariate-normal distribution, 140
R Commander, xix, 41
R console, 1–2, 16–17
R profile, xvi R.app, xvi R64.app, xvi ragged array, 381
random number function, 112
rational numbers, 103
real arguments, 361
recoding data, 68–74
recursion, 79, 375–376, 392, 396
recycling, of elements, 8–9
regression diagnostics, 31–36, 285–328
regression function, 116
regression model, assumptions of, 140
regression splines, 177–181
regression tree, 330
regressor variable, 149, 230
regular expressions, 86–91, 105, 396
removing cases, 36
repeated-measures analysis of variance, xviii reserved symbols, 11
residual deviance, 232
residuals, 286–287
deviance, 318, 319
for generalized linear models, 317–319
ordinary, 286
Pearson, 287, 314, 318, 319
plots of, 287–290
response, 317
standardized, 286
standardized deviance, 318
standardized Pearson, 318
Studentized, 31, 287, 295, 303, 318, 360
retrospective sampling, 256
RGB (red, green, blue) colors, 343
risk factors, 237
robust regression, xviii, 295
rotating plots, 124
Rseek search engine, 21
rug-plots, 111, 174, 250
S, xii–xiii S-PLUS, xiii S3, xxi, 38, 39, 77, 406–411, 413, 414, 424
S4, xxi, 38, 39, 406, 410–413, 424
sandwich estimator, 184, 317
SAS, xi, xxi, 43, 48, 53, 56, 69, 157, 215
saturated model, 232, 243, 246
scale functions, in lattice graphics, 354
scale parameter, in generalized linear models, 232, 277
scaled deviance, 232
scatterplots, 111, 115–121, 354, 356
coded, 118
enhanced, 116
jittered, 119
marked by group, 118
three-dimensional, 124
scatterplot matrix, 25, 125
scatterplot smoother, 116
See also kernel regression; lowess scientific notation, 30
scoping rules, 396, 417–422
score test, 253, 306, 312, 316
score vector, 387
search path, 57, 418
sed, 86
selector variable, 205
shadowed objects, 59, 418
side effects, of functions, 5, 38, 360, 421
sigma constraints, 215
signature, of S4
method, 412
simple linear regression, 151
simulation, 202–207
single-document interface (SDI), xv singular-value decomposition, 327, 369
singularity, 221–223, 227
slots, of S4
objects, 410, 412
span, 66, 117, 126, 346
spread, 135
spread-level plot, 135, 315
spreadsheet, importing data from, 53
SPSS, xi, 43, 48, 53, 56, 69, 157
stacked-area plots, 272
standard errors, of regression coefficients, 181–183
standardized regression coefficients, 183–184
start, for power transformation, 132
Stata, xii statistical computing environment, xii statistical model, 149
statistical packages, xi statistical-modeling functions, writing, 413–417
stem-and-leaf displays, 110
stepwise regression, 208–213, 228
storage mode, of objects, 96
strip functions, 354
structural-equation models, xviii Studentized residuals. See residuals, Studentized subset model, 208
summary graph, 116
supernormality, 303
survival analysis, xviii Sweave, xxii system time, 383
t distribution, 112
task views, 20
tests
Type I, 192, 196, 238
Type II, 193–197, 238, 253, 272
Type III, 195–197, 239, 244, 272
See also likelihood ratio tests; Wald tests text, plotting, 338
See also character data Text Wrangler, xvi three-dimensional plots, 124, 354
time-series regression, xviii timing commands, 383, 403
tracing functions, 376
See also profiling R code
transformations, 126–145, 302
arcsine square root, 133
Box-Cox, 131, 137, 139, 304, 140
Box-Tidwell, 312
inverse, 131
inverse plots, 138–140
for linearity, 138–140, 308–314
log-odds, 133
logarithms, 127, 131
logit, 133, 137
for normality, 140
of percentages, 133
power, 131
predictor, 309–313
of proportions, 133
reciprocal, 131
response, 303–309
of restricted-range variables, 133
simple power, 131
spread, to equalize, 135
variance-stabilizing, 133, 141
Yeo-Johnson, 132, 137, 140, 304
transparency, of colors, 343
transpose, matrix, 366
trellis graphics, 352
See also lattice graphics tricube weights, 346–349
Tukey’s test for nonadditivity, 289, 420
tuning parameter, 346
typographical conventions, xx
uniform distribution, 112
Unix, xiv,86
unmodeled heterogeneity, 277
user coordinates, 349
user time, 383
Wald tests, 153, 162, 172, 186, 190–192, 236–237, 239, 262, 272, 277
web searches, 20–21
website for R Companion, xvii
websites
cran.r-project.org , xiv, 20, 105, 352, 390
en.wikipedia.org , 105, 148
socserv.socsci.mcmaster.ca/-jfox/Books/Companion/ , xvii www.bioconductor.org ,92
www.opengl.org , 148, 356
www.r-project.org , xviii, 21, 414
www.rseek.org , 20, 352
www.sagepub.com , xvii
weighted least squares, 150, 281, 287, 316
Windows, xiv–xviii, xxi, 1, 2, 16, 17, 19, 20, 22, 29, 50, 54, 64, 92, 93, 146, 189, 339, 357, 358
WLS, 150, 281, 287, 316
word-processors, to save output, 17
working directory, 50
working residuals, 283
working response, 283
working weights, 282, 283
workspace, 21, 22, 418
* , 3, 4
+ , 3, 4, 7
− , 3, 4
-Inf , 128
… , 361, 362, 364, 414–416
/ , 3, 4
:,8
< , 13, 14
<- , 9
<= , 13
= , 5, 9, 13
== , 13, 66, 73
> , 13
>= , 13
? , 5, 19, 37
?? , 20
[[ ]] , 82
%*% , 365
%in% , 74
& , 13, 14
&& , 14, 374
{} , 111, 360, 374
basicPowerAxis , 137
bcPower , 132, 304
bcPowerAxis , 137
binom.test , 38
binomial , 236
bootCase , 187
box , 109, 332, 335
boxCox , 304, 305
boxcox , 304
boxCoxVariable , 306
Boxplot , 114, 137
boxplot , 38, 114
boxTidwell , 312, 313
break , 375, 388
browser , 399, 400, 402
bs , 178
bwplot , 168, 204, 353, 354
c , 7, 44, 45, 71
carWeb , xvii, xviii, 20, 49
cat , 388, 409
cbind , 26, 141, 155, 172, 389, 405
ceresPlots , 311, 317
character , 383
chisq.test , 38, 251
chol , 369
choose.dir , 50
class , 39, 40, 157, 407
cloud , 125
coef , 143, 155, 172, 187, 223, 306, 415
coeftest , 186 441
colMeans , 378
colors , 343
colSums , 378
compareCoefs , 199, 299
complete.cases , 65
complex , 383
confidenceEllipse , 189
confint , 153, 172, 188, 237
constrOptim , 390
contr.helmert , 196, 197, 215
contr.poly , 196, 216–218
contr.SAS , 215
contr.Sum , 216
contr.sum , 196, 215–217, 244
contr.Treatment , 216
contr.treatment , 159, 160, 196, 197, 214–216, 196
contrasts , 159, 214, 227
control , 282
cooks.distance , 298, 317, 320, 363
cor , 38, 218, 369
count.fields , 52
cov , 38
cov2cor , 182
crossprod , 368, 405
crPlots , 34, 310, 317
curve , 341, 341, 342, 349
cut , 70, 71
D , 201
data , 55, 225
data.frame , 67, 184, 259
dataEllipse , 189
dbinom , 38, 113
dchisq , 38, 112, 113
debug , 399, 402, 415
debugger , 399, 402
deltaMethod , 186, 200–202, 317
density , 111
det , 369
detach , 59
dev.cur , 358
dev.list , 358
dev.off , 357
dev.set , 358
deviance , 243
df , 38, 113
df.residual , 243
dfbeta , 301, 317
dfbetas , 301, 317
diag , 370, 401
dim , 76
dnorm , 38, 112, 113, 380
dt , 38, 113
dump.frames , 402
dunif , 113
edit , 47
effect , 164, 173, 174, 176, 181
eigen , 327, 369
else , 370, 372
eval.parent , 415
example , 6 example functions
abs1 , 371
convert2meters , 372
cuberoot , 420
fact1 , 373
fact2 , 374
fact3 , 374
fact4 , 375
fact5 , 375
fact6 , 376
fact7 , 376
fact8 , 376
Glmnet , 416, 417
grad , 391
inflPlot , 360–364, 386
lreg , 407, 408, 414, 415
lreg.default , 407, 408, 410, 414, 415
lreg.formula , 414, 415
lreg1 , 387–389, 399, 402–406
lreg2 , 390–392, 403
lreg3 , 399, 400, 402
lreg4 , 405, 406
lreg5 , 410
makeDigits , 393
makeNumber , 394
makePower , 419, 420
makeScale , 379
myMean , 15, 16, 21
myVar , 19, 21
negLogL , 391, 392
number2words , 394–396
numbers2words , 396, 397
print.lreg , 409, 415
print.summary.lreg , 409
sign1 , 372
sign2 , 372
squareit , 362
summary.lreg , 409
tc , 349
time1 , 383
time2 , 383
time3 , 383
time4 , 384
time5 , 384
tricube , 349
trim , 395, 396
tukeyTest , 421
exp , 128
expand.grid , 267
expression , 140, 349
f , 373
factor , 158, 159
factorial , 373
file.choose , 23, 50
fitted , 239
fix , 47, 400
for , 244, 336, 350, 373, 374
formula , 223
fractions , 367
friedman.test , 38
ftable , 203, 245, 259
function , 15, 26, 360
map , 355
mapplot , 355, 356
mapply , 377, 380, 381
marginalModelPlots , 291, 292, 317
match.arg , 372
match.call , 415
matrix , 75, 364, 383
max , 363
mean , xx, 14, 15, 37, 38, 59–61, 63, 380, 382
median , 38, 63
melt , 257
memory.limit , 93
memory.size , 93
menus
Change dir, 50
Change Working Directory, 50
Edit, xvi, 16, 17
Exit, xx File, xx, 17, 22, 50, 357
GUI preferences, xvi
Help, xviii, 16, 20
History, 358
Misc, 19, 50
New Document, 17
Open Document, 17
Packages, 1, 54
Packages & Data, 1
R, xvi
R console, 17, 25
Recording, 358
Run line or selection, 17
Save as, 17
Stop, 19, 29, 146, 189, 339
Stop locator, 29
methods , 40
missing , 361
mle , 390
model.matrix , 159, 164
ModelPlots , 292
mtext , 339, 351
multinom , 261
p.arrows , 340
pairs , 25–27, 125
palette , 343, 344
par , 123, 128, 137, 241, 268, 332–334, 339, 347, 349–351
paste , 11, 87
pbinom , 38, 113
pchisq , 38, 113, 244
pdf , 357
persp , 125
pf , 38, 113
pi , 341
pie , 343, 344
plot , 27, 28, 33, 37, 38, 115, 116, 119, 123, 125, 129, 173, 174, 180, 247, 250, 290, 329–335, 337, 338,
347, 351, 356–358, 362–364
plot.default , 330, 331
plot.formula , 331
plot.lm , 330
plotCI , 122, 356
pnorm , 38, 113, 381
points , 27, 171, 247, 329, 334, 335, 342, 348, 355
poisson , 246
polr , 271
poly , 131, 181, 217, 218, 311, 312
polygon , 340
powerTransform , 140–144, 304–306, 309
prcomp , 327, 369
predict , 205, 239, 267
princomp , 327, 369
print , 37, 401, 407, 409, 411
printCoefmat , 409
probabilityAxis , 137, 242
prod , 373
prop.table , 203
prop.test , 38
pt , 38, 113, 191
punif , 113
q , 22
qbinom , 38, 113
qchisq , 38, 112, 113
qf , 38, 113
qnorm , 38, 112, 113
qplot , 356
qqnorm , 112
qqPlot , 31, 32, 112, 295
qr , 369
qt , 38, 113
quantile , 38, 63, 70
quartz , 358
quasi , 233, 276
quasibinomial , 233, 276
quasipoisson , 233, 276, 278
quit , 22
qunif , 113
R window
Quartz, 17
R Commander, 1
R Console, xvi, xx
R.app, 1, 17
RGui, 1, 2, 17
rainbow , 343, 344
range , 38
rbinom , 38, 113, 403
rchisq , 38, 113
read.csv , 48
read.table , 94
read.fwf , 53
read.spss , 53
read.table , 23, 24, 42, 47–50, 53, 54, 56, 67, 94–96, 157–159
readLines , 86
Recall , 376, 396
recode , 70, 71, 73, 264
recover , 402
regsubsets , 213
relevel , 162, 202, 214
remove , 21, 376
rep , 46, 47, 241
repeat , 373, 375
representation , 410
residualPlots , 288–290, 292, 311, 314, 317, 319, 329, 420
residuals , 287, 415
return , 360, 373
rf , 38, 113
rnorm , 10, 38, 113
round , 218
rowMeans , 378
rownames , 389
rowSums , 378
Rprof , 404
RSiteSearch , 20
rstandard , 287
rstudent , 31, 287, 317, 318, 363
rt , 38, 113
rug , 111
runif , 113
t , 366
t.test , 38
table , 23, 38, 48, 67, 90
tail , 396
tapply , 122, 161, 167, 377, 381, 382
tempfile , 404
testTransform , 143, 144
text , 148, 334, 338, 339, 348, 349
title , 332, 351, 355
tolower , 89
trace , 376
traceback , 18, 19, 399, 401
transform , 69, 143
treatment , 196
ts.plot , 38
tukeyNonaddTest , 421, 422
unique , 90
unlink , 404
untrace , 376
update , 36, 154, 155, 163, 421
UseMethod , 39
useOuterStrips , 353
AMSsurvey , PhDs in the mathematical sciences, 250, 251, 253, 255, 256
Campbell , voter turnout in the 1956 U. S. presidential election, 240, 244, 245, 256
Davis , measured and reported weight and height, 151, 190, 199, 200, 226
DavisThin , drive for thinness scale, 377, 379
Depredations , wolf depredations in Minnesota, 354
Duncan , prestige of U.S. occupations, xx, 23, 24, 27, 29, 30, 37, 56–59, 124, 189, 200, 224, 226, 295, 297,
298, 300, 301, 361, 364
Moore , conformity, partner’s status, and authoritarianism, 166, 175, 176, 196, 216, 217, 222, 227, 381, 382
Mroz , U. S. married women’s labor-force participation, 40, 235, 324, 389, 407, 410
Ornstein , interlocking directorates among Canadian firms, 121, 128, 135, 137, 141, 246, 277, 278, 303,
304, 309, 314, 316, 322
Prestige , prestige of Canadian occupations, 50, 51, 54, 55, 58, 59, 108, 111, 112, 114, 118, 125, 134, 138,
139, 142, 144, 146, 155, 157, 163, 174, 177, 181, 191, 288, 289, 291, 293, 312, 356, 367–369, 416, 417,
420
UN , infant-mortality rates and GDP per capita of nations, 129, 132, 138, 141, 346, 347
Womenlf , Canadian married women’s labor-force participation, 71, 72, 74, 259, 264, 319
Note: References to the car package associated with this Companion are suppressed.
base, 58–61
biglm, 92
boot, 187
bootstrap, 187
foreign, 53, 56
iplots, 357
lattice, xx, 125, 168, 177, 203, 204, 330, 352–355, 358
latticeExtra, 353–355
leaps, 213
lme4, 413
lmtest, 186
locfit, 111
maps, 354
MASS, xxi, 228, 271, 279–281, 304, 367, 370
Matrix, 370
mi, 62
mice, 62
nnet, 261
norm, 62
playwith, 357
plotrix, 122, 356
Rcmdr, xix, 41, 42
reshape, 257
rggobi, 125, 357
rgl, 124, 356, 357
RGtk2, 357
rJava, 357
RODBC, 54–56, 96
rpart, 330
sfsmisc, 340
sm, 111
splines, 178
stats, 40, 58, 228
stats4, 390
survey, 92, 256
Sign up now at
www.sagepub.com/srmo
for more information.