IntroducingR Princeton University
IntroducingR Princeton University
IntroducingR Princeton University
Germn Rodrguez (grodri@princeton.edu) Princeton University Spring 2001, last Updated Spring 2011 Online Version at http://data.princeton.edu/R
Table Of Contents
1 Introduction 1.1 The R Language and Environment 1.2 Bibliographic Remarks 2 Getting Started 2.1 The R Console 2.2 Expressions and Assignments 2.3 Vectors and Matrices 2.4 Simple Graphs 3 Reading Data 3.1 Lists and Data Frames 3.2 Free-Format Input 3.3 Fixed-Format Input 3.4 Printing Data and Summaries 3.5 Plotting Data 4 Linear Models 4.1 Fitting a Model 4.2 Examining a Fit 4.3 Extracting Results 4.4 Factors and Covariates 4.5 Regression Splines 4.6 Other Options 5 Generalized Linear Models 5.1 Variance and Link Families 5.2 Logistic Regression 5.3 Updating Models 5.4 Model Selection 6 Conclusion References
1 Introduction
R is a powerful environment for statistical computing which runs on several platforms. These notes are written specially for users running the Windows version, but most of the material applies to the Mac and Linux versions as well.
These notes are organized in several sections, as shown in the table of contents on the right. I have tried to introduce key features of R as they are needed by students in my statistics classes. As a result, I often postpone (or altogether omit) discussion of some of the more powerful features of R as a programming language. Notes of local interest, such as where to find R at Princeton University, appear in framed boxes and are labeled as such. Permission is hereby given to reproduce these pages freely and host them in your own server if you wish. You may add, edit or delete material in the local notes as long as the rest of the text is left unchanged and due credit is given. Obviously I welcome corrections and suggestions for enhancement.
The ti itle function expects e a cha aracter string g with the titl e as the first t argument. W We also specif fied the optio onal argumen nt col.main= ="cornflow werblue" to set the color r of the title. There e are 657 nam med colors to o choose from m, type color rs() to see their names. The T next exam mple is based d on a demo i he R included in th distribution an nd is simply m meant to show off R's use e of colors. W We use the pie fu unction to cre eate a chart w with 16 slices s. The slices are all the sam me width, but t we fill them m with different colors obtained using g the rainbo ow function. > pie(rep(1, ,16),col=ra inbow(16)) Note N the use of o the rep fu unction to rep plicate the nu umber one 16 6 times. To see how one can n specify colo ors and labels s for the slices, try calling pie p with argu uments 1:4, c("r", "g g", "b","w w") and col=c c("red,"gr reen","blue e","white" "). To save a graph make m sure the e focus is on the graph wi ndow and choose File | Save as, from the menu. m You get t several choices of forma at, including p postcript, which is good fo or printing, and windows w meta afile, which is s ideal for em mbedding you ur graph in an nother Windo ows document. Most remarkably, you also get the png form mat, which m akes it easy to include R graphs in we eb pages s such as this s, particularly y now that this format is s supported by all major bro owsers. R als so suppo orts jpeg, but t I think png is better than n jpeg for sta atistical plots. All graphs o on these pages s are in png format. f Altern natively, you can copy the e graph to the e clipboard b y choosing File | Copy to clipboard. You get g a choice of o two formats. I recomme end that you use the meta afile format b because it's more flexible. You can then paste the graph h into a word processing o or spreadsheet document. You can also print the graph us sing File | Print. Exerc cise: Simulate e 20 observat tions from the regression model Y = + x + usin ng the x vector generated above. a Set = 1 and = 2. Use stand dard normal e errors genera ated as rnorm m(20), wher re 20 is the number of obs servations.
> person = list(name="Jane", age=24) Typing the name of the list prints all elements. You can extract a component of a list using the extract operator $. For example we can list just the name or age of this person: > person$name [1] "Jane" > person$age [1] 24 Individual elements of a list can also be accessed using their indices or their names as subscripts. For example we can get the name using person[1] or person["name"]. (You can use single or double square brackets depending on whether you want a list with the name, which is what we did, or just the name, which would require double brackets as in person[[1]] or person[["name"]]. The distinction is not important at this point.) A data frame is essentially a rectangular array containing the values of one or more variables for a set of units. The frame also contains the names of the variables, the names of the observations, and information about the nature of the variables, including whether they are numerical or categorical. Internally, a data frame is a special kind of list, where each element is a vector of observations on a variable. Data frames look like matrices, but can have columns of different types. This makes them ideally suited for representing datasets, where some variables can be numeric and others can be categorical. Data frames (like matrices) can also accommodate missing values, which are coded using the special symbol NA. Most statistical procedures, however, omit all missing values. Data frames can be created from vectors, matrices or lists using the function data.frame, but more often than not one will read data from an external file, as shown in the next two sections.
87 83 68 84 74 73 84 91
23 4 0 19 3 0 15 7
21 9 7 22 6 2 29 11
This small dataset includes an index of social setting, an index of family planning effort, and the percent decline in the crude birth rate between 1965 and 1975. The data are available on the web at http://data.princeton.edu/wws509/datasets/ in a file called effort.dat which includes a header with the variable names. R can read the data directly from the web: > fpe <- read.table("http://data.princeton.edu/wws509/datasets/effort.dat") The function used to read data frames is called read.table. The argument is a character string giving the name of the file containing the data, but here we have given it a fully qualified url (uniform resource locator), and that's all it takes. Alternatively, you could download the data and save them in a local file, or just cut and paste the data from the browser to an editor such as Notepad, and then save them. Make sure the file ends up in R's working directory, which you can find out by typing getwd(). If that is not the case you can use a fully qualified path name or change R's working directory by calling setwd with a string argument. Remember to double up your backward slashes (or use forward slashes instead) when specifying paths. The special symbol <-is R's assignment operator, which we have encountered already. Here we assigned the data to an object named fpe. To print the data simply type the name of the object. > fpe setting effort change Bolivia 46 0 Brazil 74 0 ... output edited ... Venezuela 91 7 1 10 11
In this example R detected correctly that the first line in our file was a header with the variable names. It also inferred correctly that the first column had the observation names. (Well, it did so with a little help; I made sure the row names did not have embedded spaces, hence CostaRica. Alternatively, I could have used "Costa Rica" in quotes as a row name.) You can always tell R explicitly whether or not you have a header by specifying the optional argument header=TRUE or header=FALSE to the read.table function. This is important if you have a header but lack row names, because R's guess is based on the fact that the header line has one less entry than the next row, as it did in our example. If your file does not have a header line, R will use the default variable names V1, V2, ..., etc. To override this default use read.table's optional argument col.names to assign variable names. This argument takes a vector of names. So, if our file did not have a header we could have used the command
> fpe = read.table("noheader.dat", + col.names=c("setting","effort","change")) Incidentally this is the first time that our command did not fit in a line. R code can be continued automatically in a new line simply by making it obvious that we are not done, for example ending the line with a comma, or having an unclosed left parenthesis. R responds by prompting for more with the continuation symbol + instead of the usual prompt >. If your file does not have observation names, R will simply number the observations from 1 to n. You can specify row names using read.table's optional argument row.names, which works just like col.names; type ?data.frame for more information. There are two closely related functions that can be used to get or set variable and observation names at a later time. These are called names (for the variable names), and row.names (for the observation names). Thus, if our file did not have a header we could have read the data and then changed the default variable names using the names function: > fpe = read.table("noheader.dat") > names(fpe) = c("setting","effort","change") Technical Note: If you have a background in other programming languages you may be surprised to see a function call on the left hand side of an assignment. These are special 'replacement' functions in R. They extract an element of an object and then replace its value. In our example all three-variables were numeric. R will handle string variables with no problem. If one of our variables was sex, coded M for males and F for females, R would have created a factor, which is basically a categorical variable that takes one of a finite set of values called levels. In Section 5 we will use a data frame with categorical variables to illustrate logistic regression. Another way to generate factors is by grouping a numeric covariate. An example appears in Section 4 below. Exercise: Use a text editor to create a small file with the following three lines: a b c 1 2 3 4 5 6 Read this file into R so the variable names are a, b and c. Now delete the first row and read the file again so the variable names are still a, b and c.
Here I assume that the file in question is called fixedformat.dat. I assign column names just as before, using the col.names parameter. The novelty lies in the next argument, called sep, which is used to indicate how the variables are separated. The default is white space, which is appropriate when the variables are separated by one or more blanks or tabs. If the data are separated by commas, a common format with spreadsheets, you can specify sep=",". Here we created a vector with the numbers 1, 3 and 5 to specify the character position (or column) where each variable starts. Type ?read.table for more details.
effort Min. : 0.00 1st Qu.: 3.00 Median : 8.00 Mean : 9.55 3rd Qu.:15.25 Max. :23.00
change Min. : 0.00 1st Qu.: 5.50 Median :10.50 Mean :14.30 3rd Qu.:22.75 Max. :40.00
As you can see, you get the min and max, 1st and 3rd quartiles, median and mean. For categorical variables you get a table of counts. Alternatively, you may ask for a summary of a specific variable. Or use the functions mean and var for the mean and variance of a variable, or cor for the correlation between two variables, as shown below: > mean(effort) [1] 9.55 > cor(effort,change) [1] 0.80083 Elements of data frames can be addressed using the subscript notation introduced in Section 2.3 for vectors and matrices. For example to list the countries that had a family planning effort score of zero we can use > fpe[effort == 0,] setting effort change Bolivia 46 0 1 Brazil 74 0 10 Nicaragua 68 0 7 Peru 73 0 2
This works w becaus se the expression effort == 0 select ts the rows (countries) where the effort score is zero o, while leaving the colum mn subscript b blank selects all columns (variables). The fa act that the rows r are nam med allows ye et another wa ay to select elements: by name. Here's s how to t print the data for Chile: : > fpe e["Chile",] setting effort chang ge Chile e 89 16 29 2 Exerc cise: Can you list the coun ntries where social s setting is high (say above 80) but effort is low (s say below 10)? Hint: recall the elemen nt-by-elemen nt logical oper rator &.
> plot(effort, change, pch=21, bg="gold") > title("Scatterplot of Change by Effort", col.main="#3366CC") I used two optional arguments that work well together: pch=21 selects a special plotting symbol, in this case a circle, that can be colored and filled; and bf="gold" selects the fill color for the symbol. I left the perimeter black, but you can change this color with the col argument. To identify points in a scatterplot use the identify function. Try the following (assuming the scatterplot is still the active graph): > identify(effort, change, row.names(fpe), ps=9) The first three parameters to this function are the x and y coordinates of the points and the character strings to be used in labeling them. The ps optional argument specifies the size of the text in points; here I picked 9-point labels. Now click within a quarter of an inch of a point and the name of the country should appear in the graph. Which country had the most effort but only moderate change? Which one had the most change? To quit identifying points right click on the graph and select Stop from the pop-up menu. The function returns the indices of the units selected. (Click on the RConsole to make it the focused window before you type more commands.) Another interesting plot to try is pairs, which draws a scatterplot matrix. In our example try > pairs(fpe) and you will see a 3 by 3 matrix of scatterplots with the variable names down the diagonal and a plot of each variable against every other one. Before you quit this session consider saving the fpe data.frame. To do this use the save function > save(fpe, file="fpe.Rdata") > load("fpe.rdata") The first argument specifies the object to be saved, and the file argument provides the name of a file, which will be in the working directory unless a full path is given. (Remember to double-up your backslashes, or use forward slashes instead.) By default R saves objet using a compact binary format which is portable across all R platforms. There is an optional argument ascii that can be set to TRUE to save the object as ASCII text. This option was handy to transfer R objects across platforms but is no longer needed. The menu item File | Save Image and its companion File | Load Image can be used to save and load an image of the entire workspace, including all objects that have been created (and not removed) in the session.
Exercise: Use R to create a scatterplot of change by setting, cut and paste the graph into a document in your favorite word processor, and try resizing and printing it. I recommend that you use the windows metafile format for the cut and paste operation.
4 Linear Models
Let us try some linear models, starting with multiple regression and analysis of covariance models, and then moving on to models using regression splines. In this section I will use the data read in Section 3, so make sure the fpe data frame is attached to your current session.
The output includes the model formula and the coefficients. You can get a bit more detail by requesting a summary: > summary(lmfit) Call: lm(formula = change ~ setting + effort)
10
Median 0.6384
3Q 3.2250
Max 15.8530 Pr(>|t|) 0.057516 . 0.022629 * 0.000484 *** '*' 0.05 '.' 0.1 ' ' 1
Coefficients: Estimate Std. Error t value (Intercept) -14.4511 7.0938 -2.037 setting 0.2706 0.1079 2.507 effort 0.9677 0.2250 4.301 --Signif. codes: 0 '***' 0.001 '**' 0.01
Residual standard error: 6.389 on 17 degrees of freedom Multiple R-Squared: 0.7381, Adjusted R-squared: 0.7073 F-statistic: 23.96 on 2 and 17 DF, p-value: 1.132e-05 The output includes a more conventional table with parameter estimates and standard errors, as well the residual standard error and multiple R-squared. (By default S-Plus includes the matrix of correlations among parameter estimates, which is often bulky, while R sensibly omits it. If you really need it, add the option correlation=TRUE to the call to summary.) To get a hierarchical analysis of variance table corresponding to introducing each of the terms in the model one at a time, in the same order as in the model formula, try the anova function: > anova(lmfit) Analysis of Variance Table Response: change Df Sum Sq Mean Sq F value Pr(>F) setting 1 1201.08 1201.08 29.421 4.557e-05 *** effort 1 755.12 755.12 18.497 0.0004841 *** Residuals 17 694.01 40.82 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 Alternatively, you can plot the results using > plot(lmfit) This will produce a set of four plots: residuals versus fitted values, a Q-Q plot of standardized residuals, a scale-location plot (square roots of standardized residuals versus fitted values, and a plot of residuals versus leverage that adds bands corresponding to Cook's distances of 0.5 and 1. R will prompt you to click on the graph window or press Enter before showing each plot, but we can do better. Type par(mfrow=c(2,2)) to set your graphics window to show four plots at once, in a layout with 2 rows and 2 columns. Then redo the graph using plot(lmfit). To go back to a single graph per window use par(mfrow=c(1,1)). There are many other ways to customize your graphs by setting high-level parameters, type ?par to learn more.
`.'
0.1
` '
11
Techn nical Note: Yo ou may have noticed that we have use ed the functio on plot with all kinds of argum ments: one or two variable es, a data fra ame, and now w a linear mo odel fit. In R j jargon plot is s a gen neric function. It checks fo or the kind of f object that y you are plotting and then calls the appro opriate (more e specialized) function to do d the work. There are ac ctually many plot functions in R, including plo ot.data.fr rame and plo ot.lm. For m most purposes s the generic function will do the e right thing and you don't need to be concerned a about its inne er workings.
12
extracts the fitted values. In this case it will also print them, because we did not asign them to anything. (The longer form fitted.values is an alias.) To extract the coefficients use the coef function (or the longer form coefficients) > coef(lmfit) (Intercept) setting -14.4510978 0.2705885
effort 0.9677137
To get the residuals, use the residuals function (or the abbreviation resid): > residuals(lmfit) 1 2 3 3.0040262 4.4275478 3.8853007 ... output edited ...
4 3.1323628
5 0.3996747
6 15.8530144
If you are curious to see exactly what a linear model fit produces, try the function > names(lmfit) [1] "coefficients" "residuals" [5] "fitted.values" "assign" [9] "xlevels" "call"
which lists the named components of a linear fit. All of these objects may be extracted using the $ operator. However, whenever there is a special extractor function you are encouraged to use it.
13
As you can see, family planning effort has been treated automatically as a factor, and R has generated the necessary dummy variables for moderate and strong programs treating weak as the reference cell. Choice of Contrasts: R codes unordered factors using the reference cell or "treatment contrast" method. The reference cell is always the first category which, depending on how the factor was created, is usually the first in alphabetical order. If you don't like this choice, R provides a special function to re-order levels, check out help(relevel). S codes unordered factors using the Helmert contrasts by default, a choice that is useful in designed experiments because it produces orthogonal comparisons, but has baffled many a new user. Both R and S-Plus code ordered factors using polynomials. To change to the reference cell method for unordered factors use the following call > options(contrasts=c("contr.treatment","contr.poly")) Back on to our analysis of covariance fit. You can obtain a hierarchical anova table for the analysis of covariance model using the anova function: > anova(covfit) Analysis of Variance Table Response: change Df Sum Sq Mean Sq F value Pr(>F) setting 1 1201.08 1201.08 36.556 1.698e-05 *** effortg 2 923.43 461.71 14.053 0.0002999 *** Residuals 16 525.69 32.86 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 Type ?anova to learn more about this function.
`.'
0.1
` '
14
will generate cubic B-splines with interior knots placed at 66, 74 and 84. This basis will use seven degrees of freedom, four corresponding to the constant, linear, quadratic and cubic terms, plus one for each interior knot. Alternatively, you may specify the number of degrees of freedom you are willing to spend on the fit using the parameter df. For cubic splines R will choose df-4 interior knots placed at suitable quantiles. You can also control the degree of the spline using the parameter degree, the default being cubic. If you like natural cubic splines, you can obtain a well-conditioned basis using the function ns, which has exactly the same arguments as bs except for degree, which is always three. To fit a natural spline with five degrees of freedom, use the call > setting.ns <- ns(setting, df=5) Natural cubic splines are better behaved than ordinary splines at the extremes of the range. The restrictions mean that you save four degrees of freedom. You will probably want to use two of them to place additional knots at the extremes, but you can still save the other two. To fit an additive model to fertility change using natural cubic splines on setting and effort with only one interior knot each, placed exactly at the median of each variable, try the following call: > splinefit = lm( change ~ ns(setting, knot=median(setting)) + + ns(effort, knot=median(effort)) ) Here we used the parameter knot to specify where we wanted the knot placed, and the function median to calculate the median of setting and effort. Do you think the linear model was a good fit? Natural cubic splines with exactly one interior knot require the same number of parameters as an ordinary cubic polynomial, but are much better behaved at the extremes.
data subset
to specify a dataset, in case it is not attached to restrict the analysis to a subset of the data
15
> X <- cbind(1,effort,setting) > solve( t(X) %*% X ) %*% t(X) %*% change [,1] [1,] -14.4510978 [2,] 0.9677137 [3,] 0.2705885 Compare these results with coef(lmfit).
gaussian gaussian identity binomial binomial logit, probit or cloglog poisson poisson log, identity or sqrt Gamma Gamma inverse, identity or log inverse.gaussian inverse.gaussian 1/mu^2 quasi user-defined user-defined As can be seen, each of the first five choices has an associated variance function (for binomial the binomial variance (1-)), and one or more choices of link functions (for binomial the logit, probit or complementary log-log). As long as you want the default link, all you have to specify is the family name. If you want an alternative link, you must add a link argument. For example to do probits you use > glm( formula, family=binomial(link=probit)) The last family on the list, quasi, is there to allow fitting user-defined models by maximum quasi-likelihood.
16
age education wantsMore notUsing using <25 <25 <25 <25 25-29 25-29 25-29 25-29 30-39 30-39 30-39 40-49 40-49 40-49 low low high high low low high high low low high low high high yes no yes no yes no yes no yes no no no yes no 53 10 212 50 60 19 155 65 112 77 68 46 8 12 6 4 52 10 14 10 54 27 33 46 78 48 8 31
The data are available from the datasets section of the website for my generalized linear models course. Visit http://data.princeton.edu/wws509/datasets to read a short description and follow the link to cuse.dat. Of course the data can be downloaded directly from R: > cuse <- read.table("http://data.princeton.edu/wws509/datasets/cuse.dat", + header=TRUE) > cuse age education wantsMore notUsing using 1 I specified the header parameter as TRUE, because otherwise it would not have been obvious that the first line in the file has the variable names. There are no row names specified, so the rows will be numbered from 1 to 16. Print cuse to make sure you got the data in alright. Then make it your default dataset: > attach(cuse) Let us first try a simple additive model where contraceptive use depends on age, education and wantsMore: > lrfit <- glm( cbind(using, notUsing) ~ + age + education + wantsMore , family = binomial) There are a few things to explain here. First, the function is called glm and I have assigned its value to an object called lrfit (for logistic regression fit). The first argument of the function is a model formula, which defines the response and linear predictor. With binomial data the response can be either a vector or a matrix with two columns.
If the response is a vector, it is treated as a binary factor with the first level representing "success" and all others representing "failure". In this case R generates a vector of ones to represent the binomial denominators.
17
Alternatively, the response can be a matrix where the first column shows the number of "successes" and the second column shows the number of "failures". In this case R adds the two columns together to produce the correct binomial denominator.
Because the latter approach is clearly the right one for us I used the function cbind to create a matrix by binding the column vectors containing the numbers using and not using contraception. Following the special symbol ~ that separates the response from the predictors, we have a standard Wilkinson-Rogers model formula. In this case we are specifying main effects of age, education and wantsMore. Because all three predictors are categorical variables, they are treated automatically as factors, as you can see by inspecting the results: > lrfit Call: glm(formula = cbind(using, notUsing) ~ age + education + wantsMore, family = binomial) age25-29 0.3894 age30-39 0.9086 age40-49 1.1892 educationlow -0.3250
Degrees of Freedom: 15 Total (i.e. Null); 10 Residual Null Deviance: 165.8 Residual Deviance: 29.92 AIC: 113.4 Recall that R sorts the levels of a factor in alphabetical order. Because age. Similarly, high is the reference cell for education because high comes before low! Finally, R picked no as the base for wantsMore. If you are unhappy about these choices you can (1) use relevel to change the base category, or (2) define your own indicator variables. I will use the latter approach by defining indicators for women with high education and women who want no more children: > noMore <- wantsMore == "no" > hiEduc <- education == "high" Now try the model again: > glm( cbind(using,notUsing) ~ age + hiEduc + noMore, family=binomial) Call: glm(formula = cbind(using, notUsing) ~ age + hiEduc + noMore, family = binomial) Coefficients: (Intercept) -1.9662 age25-29 0.3894 age30-39 0.9086 age40-49 1.1892 hiEduc 0.3250 noMore 0.8330
Degrees of Freedom: 15 Total (i.e. Null); 10 Residual Null Deviance: 165.8 Residual Deviance: 29.92 AIC: 113.4 The residual deviance of 29.92 on 10 d.f. is highly significant: > 1-pchisq(29.92,10) [1] 0.0008828339
18
so we need a better model. One of my favorites introduces an interaction between age and desire for no more children: > lrfit <- glm( cbind(using,notUsing) ~ age * noMore + hiEduc , family=binomial) > lrfit Call: glm(formula = cbind(using, notUsing) ~ age * noMore + hiEduc, family = binomial) Coefficients: (Intercept) -1.80317 noMore 0.06622 age40-49:noMore 1.36167 age25-29 0.39460 hiEduc 0.34065 age30-39 0.54666 age25-29:noMore 0.25918 age40-49 0.57952 age30-39:noMore 1.11266
Degrees of Freedom: 15 Total (i.e. Null); 7 Residual Null Deviance: 165.8 Residual Deviance: 12.63 AIC: 102.1 Note how R built the interaction terms automatically, and even came up with sensible labels for them. The model's deviance of 12.63 on 7 d.f. is not significant at the conventional five per cent level, so we have no evidence against this model. To obtain more detailed information about this fit try the summary function: > summary(lrfit) Call: glm(formula = cbind(using, notUsing) ~ age * noMore + hiEduc, family = binomial) Deviance Residuals: Min 1Q Median -1.30027 -0.66163 -0.03286 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.80317 0.18018 -10.008 < 2e-16 *** age25-29 0.39460 0.20145 1.959 0.05013 . age30-39 0.54666 0.19842 2.755 0.00587 ** age40-49 0.57952 0.34733 1.669 0.09522 . noMore 0.06622 0.33064 0.200 0.84126 hiEduc 0.34065 0.12576 2.709 0.00676 ** age25-29:noMore 0.25918 0.40970 0.633 0.52699 age30-39:noMore 1.11266 0.37398 2.975 0.00293 ** age40-49:noMore 1.36167 0.48422 2.812 0.00492 ** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' (Dispersion parameter for binomial family taken to be 1) Null deviance: 165.772 Residual deviance: 12.630 AIC: 102.14 on 15 on 7 degrees of freedom degrees of freedom 3Q 0.81945 Max 1.73851
0.1
` '
Number of Fisher Scoring iterations: 3 R follows the popular custom of flagging significant coefficients with one, two or three stars depending on their p-values. Try plot(lrfit). You get the same plots as in a linear model, but adapted to a generalized linear model; for example the residuals plotted are deviance
19
residuals (the square root of the contribution of an observation to the deviance, with the same sign as the raw residual). The functions that can be used to extract results from the fit include
residuals or resid, for the deviance residuals fitted or fitted.values, for the fitted values (estimated probabilities) predict, for the linear predictor (estimated logits) coef or coefficients, for the coefficients, and deviance, for the deviance.
Some of these functions have optional arguments; for example, you can extract five different types of residuals, called "deviance", "pearson", "response" (response - fitted value), "working" (the working dependent variable in the IRLS algorithm - linear predictor), and "partial" (a matrix of working residuals formed by omitting each term in the model). You specify the one you want using the type argument, for example residuals(lrfit,type="pearson").
Adding the interaction has reduced the deviance by 17.288 at the expense of 3 d.f. If the argument to anova is a single model, the function will show the change in deviance obtained by adding each of the terms in the order listed in the model formula, just as it did for linear models. Because this requires fitting as many models as there are terms in the formula, the function may take a while to complete its calculations.
20
The anova function lets you specify an optional test. The usual choices will be "F" for linear models and "Chisq" for generalized linear models. Adding the parameter test="Chisq" adds p-values next to the deviances. In our case > anova(lrfit,test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: cbind(using, notUsing) Terms added sequentially (first to last) NULL age noMore hiEduc age:noMore Df Deviance Resid. Df Resid. Dev P(>|Chi|) 15 165.772 3 79.192 12 86.581 4.575e-17 1 49.693 11 36.888 1.798e-12 1 6.971 10 29.917 0.008 3 17.288 7 12.630 0.001
We can see that all terms were highly significant when they were introduced into the model.
The basic idea of the procedure is to start from a given model (which could well be the null model) and take a series of steps by either deleting a term already in the model or adding a term from a list of candidates for inclusion, called the scope of the search and defined, of course, by a model formula. Selection of terms for deletion or inclusion is based on Akaike's information criterion (AIC). R defines AIC as -2 maximized log-likelihood + 2 number of parameters (S-Plus defines it as the deviance minus twice the number of parameters in the model. The two definitions differ by a constant, so differences in AIC are the same in the two environments.) The procedure stops when the AIC criterion cannot be improved. In R all of this work is done by calling a couple of functions, add1 and drop1, that consider adding or dropping a term from a model. These functions can be very useful in model selection, and both of them accept a test argument just like anova. Consider first drop1. For our logistic regression model, > drop1(lrfit, test="Chisq") Single term deletions Model: cbind(using, notUsing) ~ age + noMore + hiEduc + age:noMore
21
Df Deviance AIC LRT Pr(Chi) 12.630 102.137 hiEduc 1 20.099 107.607 7.469 0.0062755 ** age:noMore 3 29.917 113.425 17.288 0.0006167 *** --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05
`.'
0.1
` '
Obviously we can't drop any of these terms. Note that R considered dropping the main effect of education and the age by want no more interaction, but did not examine the main effects of age or want no more, because one would not drop these main effects while retaining the interaction. The sister function add1 requires a scope to define the additional terms to be considered. In our example we will consider all possible two-factor interactions: > add1(lrfit, ~.^2,test="Chisq") Single term additions Model: cbind(using, notUsing) ~ age + noMore + hiEduc + age:noMore Df Deviance AIC LRT Pr(Chi) 12.630 102.137 age:hiEduc 3 5.798 101.306 6.831 0.07747 . noMore:hiEduc 1 10.824 102.332 1.806 0.17905 --Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.'
0.1
` '
We see that neither of the missing two-factor interactions is significant by itself at the conventional five percent level. (However, they happen to be jointly significant.) Note that the model with the age by education interaction has a lower AIC than our starting model. The step function will do an automatic search. Here we let it search in a scope defined by all two-factor interactions: > search <- step(additive, ~.^2) ... trace output supressed ... The step function produces detailed trace output that we have supressed. The returned object, however, includes an anova component that summarizes the search: > search$anova Step 1 2 + age:noMore 3 + age:hiEduc 4 + hiEduc:noMore
Df Deviance Resid. Df Resid. Dev AIC NA NA 10 29.917222 113.4251 -3 -17.287669 7 12.629553 102.1375 -3 -6.831288 4 5.798265 101.3062 -1 -3.356777 3 2.441488 99.9494
As you can see, the automated procedure introduced, one by one, all three remaining twofactor interactions, to yield a final AIC of 99.9. This is an example where AIC, by requiring a deviance improvement of only 2 per parameter, may have led to overfitting the data. Some analysts prefer a higher penalty per parameter. In particular, using log(n) instead of 2 as a multiplier yields BIC, the Bayesian Information Criterion. In our example log(1607) = 7.38, so we would require a deviance reduction of 7.38 per additional parameter. The step function accepts k as an argument, with default 2. You may verify that specifying k=log(1607) leads to a much simpler model; not only are no new interactions introduced, but the main effect of education is dropped (even though it is significant).
22
6 Conclusion
These notes have hardly scratched the surface of R, which has many more statistical functions. These include functions to calculate the density, cdf, and inverse cdf of distributions such as chi-squared, t, F, lognormal, logistic and others. The survival library includes methods for the estimation of survival curves, tests of differences between survival curves, and Cox proportional hazards models. The library nlme includes code for fitting linear mixed effect models (including multilevel models) to normally distributed data. Many new statistical procedures are first made available to the research community in the form of S-Plus and R functions. In addition, R is a full-fledged programming language, with a rich complement of mathematical functions, matrix operations and control structures. If you would like to have a function to compute logits, for example, you can write one just like this: logit <- function(p) { log(p/(1-p)) } This function takes as argument a vector of proportions and returns the logits. (The last quantity calculated in a function is returned by default.) Of course this is a very primitive version, because there is no argument checking. A somewhat better version is this: logit <- function(p) { if (!is.numeric(p) || any(p1)) stop("argument must be probabilities between 0 and 1") log(p/(1-p)) } The function any called with a logical vector returns true if any element of the vector is true. Of course a value may be in the range (0,1) but so close to either extreme that calculation of the logit could fail; bullet-proofing the function would require more sophisticated code, but the version above is serviceable. R is an interpreted language but it is reasonably fast, particularly if you take advantage of the fact that operations are vectorized and try to avoid looping. Where efficiency is crucial you can always write a function in a compiled language such as C or Fortran and then call it from R. Some of my work on multilevel generalized linear models uses this approach. To learn more about programming R read Venables and Ripley (2000), Chambers (2008), and the manual on Writing R Extensions that comes with the R distribution.
References
Becker, Richard A. and John M. Chambers (1984). S: An Interactive Environment for Data Analysis and Graphics Wadsworth, CA. Becker, Richard A.; John M. Chambers and Allan R. Wilks (1988). The New S Language. Chapman & Hall, London Braun, W. John and Duncan J. Murdoch (2007). A First Course in Statistical Programming with R. Cambridge University Press, Cambridge. Chambers, John M. (1998). Programming with Data. Springer, New York.
23
Chambers, John M (2008). Software for Data Analysis: Programming with R. Springer, New York. Chambers, John M. and Trevor J. Hastie, Editors (1992). Statistical Models in S. Chapman & Hall, London. Dalgaard, Peter (2008). Introductory Statistics with R. 2nd Edition Springer, New York. Everitt, Brian and Torsten Hothorn (2006). A Handbook of Statistical Analyses Using R. Chapman & Hall/CRC, Boca Raton, FL. Fox, John (2002). An R and S-Plus Companion to Applied Regression. Sage Publications, Thousand Oaks, CA. Murrell, Paul (2005). R Graphics. Chapman & Hall/CRC, Boca Raton, FL. Pinheiro, Jose C. and Douglas M. Bates (2000). Mixed-Effects Models in S and S-Plus. Springer, New York. Therneau,Terry M. and Patricia M. Grambsch (2000). Modeling Survival Data: Extending the Cox Model. Statistics for Biology and Health. Springer, New York. Venables, William N. and Brian D. Ripley (2000). S Programming. Springer, New York. Venables, William N. and Brian D. Ripley (2002). Modern Applied Statistics with S. Fourth Edition. Springer, New York. (Earlier editions published in 1994, 1997 an 1999.)
24