Introduction To R: Benny Yakir
Introduction To R: Benny Yakir
Introduction To R: Benny Yakir
Benny Yakir
R is a freely distributed software for data analysis. In order to introduce R let us quote the rst paragraphs from the manual Introduction to R by W. N. Venables, D. M. Smith and the R Development Core Team. (The full document, as well as access to the installation of the software itself, are available online at http://cran.r-project.org/ ): R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Among other things it has an eective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either directly at the computer or on hardcopy, and a well developed, simple and eective programming language (called S) which includes conditionals, loops, user dened recursive functions and input and output facilities. (Indeed most of the system supplied functions are themselves written in the S language.) The term environment is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specic and inexible tools, as is frequently the case with other data analysis software. R is very much a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis. 1
R may be obtained as a source code or installed using a pre-compiled code on the Linux, Mackintosh and the Windows operating systems. Programming in R for this book was carried out under Windows. You may nd more detailed information regarding the installation of R on the Windows operating system at http://www.biostat.jhsph.edu/ kbroman/Rintro/Rwin.html. After installing R under the Windows operating system an icon will be added to the desktop. Double clicking on that icon will open the window of the R system, which contains the R Console sub-window. We found it convenient to have a separate working directory for each project. It is convenient to copy the R icon into that directory and to set the working directory by coping its path (in double quotes) in the appropriate box ("start in:") in the Shortcuts slip of the Properties of the icon (which can be selected by right-clicking the icon.) The R language is an interactive expression-oriented programming langauge. The elementary commands may consist of expressions, which are immediately evaluated, printed to the standard output and lost. Alternatively, expressions can be assigned to object, which store the evaluation of the expression. In the later case the result is not printed out to the screen. These objects are accessible for the duration of the session, and are lost at the end of the session, unless they are actively stored. At the end of the session the user is prompted to store the entire workspace image, including all objects that were created during the session. If Yes is selected then the objects used in the current session will be available in the next. If No is selected then only objects from the last saved image will remain. Commands are separated either by a semi-colon (;), or by a newline. Consider the following example, which we type into the R Console window: > x <- c(1,2,3,4,5,6) > x [1] 1 2 3 4 5 6 Note that in the rst line we created an object named x (a vector of length 6, which stores the value 1 . . . , 6). In the second line we evaluated the expression x, which printed out the actual values stored in x. In the formation of the object x we have applied the c oncatenation function c. This function takes inputs and combine them together to form a vector. Once created, an object can be manipulated in order to create new objects. Dierent operations and functions can applied to the object. The resulting objects, in turn, can be stored with a new name or with the previous name. In the latter case, the content of the object is replaced by the new content. Continue the example: 2
> x*2 [1] 2 4 6 8 10 12 > x [1] 1 2 3 4 5 6 > x <- x*2 > x [1] 2 4 6 8 10 12 Observe that the original content of x was not changed due to the multiplication by two. The change took place only when we deliberately assigned new values to the object x. Say we want to compute the average of the vector x. The function mean can be applied to produce: > mean(x) [1] 7 A more complex issue is to compute the average of a subset of x, say the values larger than 6. Selection of a sub-vector can be conducted by use of the vector index, which is accessible by the use of square brackets next to the object. Indexing can be implemented in several ways, including the standard indexing of a sequence using integers. An alternative method of indexing, which is natural in many applications, is via a vector with logical TRUE/FALSE components. Consider the following example: > x > 6 [1] FALSE FALSE FALSE > x[x > 6] [1] 8 10 12 > mean(x[x > 6]) [1] 10 TRUE TRUE TRUE
Observe that the vector x > 6 is a logical vector of the same length as the vector x. Only the components of x parallel to the components with a TRUE value in the logical indexing vector are selected. In the last line of the example the resulting object is used as the input to the function mean, which produces the expected value of 10. For comparison consider a dierent example: > x*(x > 6) [1] 0 0 0 8 10 12 > mean(x*(x >6)) [1] 5 3
In this example we multiplied a vector of integers x with a vector of logical values (x > 6). The result was a vector of length 6 with zero components where the logical vector takes the value FALSE and the original values of x where the logical value takes the value TRUE. Two points should be noted. First, observe that R can interpret a product of a vector with integer components and a vector with logical components in a reasonable way. Standard programming languages may have produced error messages in such a circumstance. In this case, R translates the logical vector into a vector with integer values one for TRUE and zero for FALSE. The outcome, a product of two vectors with integer components, is a vector of the same type. The second point to make is that multiplication of two vectors is conducted term by term. It is not the inner product between vectors. A dierent operator is used in R in order to preform inner products. Next let us try to program in R functions that simulate the processes meiosis and mating. Imagine a given locus on a given chromosome with two possible variate forms (alleles ). One is denoted by A and the other by a. We have in mind aa animal, say a laboratory mouse, that produces gametes. Assume that the paternal allele of the mouse is A and the maternal allele is a. > n <- 9 > pat <- rep("A",n) > mat <- rep("a",n) > pat [1] "A" "A" "A" "A" "A" "A" "A" "A" "A" > mat [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" > mode(pat) [1] "character" Observe that pat and mat are vectors of character strings. They are of length 9. The function rep is useful to produce repetitions of patterns. Combining it with the function seq, that forms regular sequences is useful at times. The gametes that are produced by the process of meiosis may be of either of the types at the given locus. According to Mendels rst law of segregation, the probabilities of the two types are even: > from.mat <- rbinom(n,1,0.5) > offspring <- pat > from.mat==1 [1] TRUE TRUE FALSE TRUE TRUE 4
TRUE
TRUE FALSE
TRUE
> offspring[from.mat==1] [1] "A" "A" "A" "A" "A" "A" > mat[from.mat==1] [1] "a" "a" "a" "a" "a" "a" > offspring[from.mat==1] <> offspring [1] "a" "a" "A" "a" "a" "a" > 2:6 [1] 2 3 4 5 6 > offspring[2:6] [1] "a" "A" "a" "a" "a" > offspring[-(2:6)] [1] "a" "a" "A" "a"
In the nal part of this demonstration we use integers to index components of the vector. The minus sign is used in order to identify the indices we would like to exclude. It would be convenient to wrap the code that produces random gametes in a function. R functions produced with the aid of the function function. The arguments of the produced function are identied in the brackets. The function evaluates the expression that follows. Composite expressions can be combined together by placing them within curly brackets. The output of the function may be specied with the function return. Otherwise, the function returns the evaluation of the expression. > meiosis <- function(GF,GM) + { + from.GM <- rbinom(length(GF),1,0.5) + GS <- GF + GS[from.GM==1] <- GM[from.GM==1] + return(GS) + } > meiosis(pat,mat) [1] "A" "a" "a" "a" "A" "a" "A" "a" "A" The the process of mating a father mouse is donating its gametes (sperm) to the mother. These gametes merge with the mothers gametes (eggs) to produce the osprings. > male <- list(pat=rep("A",n),mat=rep("a",n)) > female <- list(pat=rep("a",n),mat=rep("a",n)) 5
We consider a father, which is heterozygote at the given locus and a mother, which is homozygote at that locus. An animal is represented by a list, which contains the two grand-parental alleles. List in R are a special type of vector. Each entry to a vector of type list can store any type of object, regardless of the type of objects in the other entries. Here each entry stores a vector. The entries in this example are given names, which can then be used in order to refer to the entry. We demonstrate that using the function cross, which simulates the formation of ospring by crossing a mother mouse with father mouse: > cross <- function(fa,mo) + { + pat <- meiosis(fa$pat,fa$mat) + mat <- meiosis(mo$pat,mo$mat) + return(list(pat=pat, mat=mat)) + } Running the function produces: > cross(male,female) $pat [1] "A" "a" "a" "A" "A" "a" "a" "A" "A" $mat [1] "a" "a" "a" "a" "a" "a" "a" "a" "a"