Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

R Lectures

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

What is R? R is an open-source environment for statistical computing and visualization.

It is the product of an active movement among statisticians for a powerful, programmable, portable, and open computing environment, applicable to the most complex and sophisticated problems, as well as routine analysis, without any restrictions on access or use. For a more detail description visit the R Project Home page. Advantages i. ii. iii. iv. v. It is completely free under the GNU public license It is freely-available over the internet, via a large network of mirror servers It runs on almost all operating systems It is the product of international collaboration between top computational statisticians and computer language designers It allows statistical analysis and visualization of unlimited sophistication; you are not restricted to a small set of procedures or options, and because of the contributed packages, you are not limited to one method of accomplishing a given computation or graphical presentation It can work on objects of unlimited size and complexity with a consistent, logical expression language It stimulates critical thinking about problem-solving rather than a push the button mentality It is fully programmable, with its own sophisticated computer language It can exchange data in MS-Excel, text, fixed and delineated formats (e.g.CSV), so that existing datasets are easily imported, and results computed in R are easily exported

vi. vii. viii. ix.

Starting R

Stopping R To stop R session, type q() at the command prompt, or select the File Exit menu item in the Windows GUI.

Setting up a workspace An important concept in R is the workspace, which contain the local data and procedures for a given statistics project. Under Windows this is usually determined by the folder from which R is started. Under Windows, the easiest way to set up a statistics project is:

a. Create a shortcut to RGui.exe on your desktop; b. Modify its properties so that its in your working directory rather than the default

The command prompt You perform most actions in R by typing commands in a console Window, in response to a command prompt, which usually looks like this: > The > is a prompt symbol displayed by R, not typed by you. This is Rs way of telling you its ready for you to type a command. Type your command and press the ENTER or RETURN keys; R will execute your command. If your entry is not a complete R command, R will prompt you to complete it with the continuation prompt symbol: + R will accept the command once it is syntactically complete, in particular the parenthesis must balance. Once the command is complete, R then presents its results in the same console window, directly under your command. If you want to abort the current command (i.e. not complete it), press the ESC key.

Saving your analysis steps The File Save to file menu command will save the entire console contents, i.e. both your commands and Rs response, to a text file, which you can later review and edit with any text editor.

Saving your graphs In the Windows version of R, you can save any graphical output for insertion into documents or printing. If necessary, bring the graphics window to the front (e.g. click on its title bar), select menu command File Save as and then one of the formats. Most useful for insertion into MS-Word documents is Metafile.

WRITING AND RUNNING SCRIPTS After you worked out an analysis by typing a sequence of commands, you will probability want to re-run them on edited data, subsets etc. This is easy to do by means of script, which are simply lists of commands in a file, written exactly as you would type them at the console. They are run with the source method. A useful feature of scripts is that you can include comments (lines that begin with # character) to explain to yourself or others what the script is doing and why. Heres a step-by-step description of how to create and run a simple script which draws two random samples from a normal distribution and computes their correlation: 1) Open a pure-text editor (one that does not insert any formatting); for example under MS-Windows you can use Notepad or Wordpad; 2) Type in the following lines: X <- rnorm(100, 180, 20) Y <- rnorm(100, 180, 20) Plot(x, y) Cor.test(x,y) 3) Save the file with the name test.R, in a convenient directory 4) Start R (if its not already running) 5) In R, select menu command File Source R code 6) In the file selection dialog, locate the file test.R that you just saved and select it; R will run the script 7) Examine the output. You can source the file directly from the command line. Instead of steps 5 and 6 above, just type source(test.R) at the R command prompt.

Sample datasets R comes with many example datasets (part of the default datasets package) and most add-in packages also include example datasets. Some of the datasets are classics in a particular application field; an example is the iris dataset used extensively by R A Fisher to illustrate multivariate methods. To see the list of installed datasets, use the data method with an empty argument: > data() To see the datasets in a single add-in package, use the package= argument: > data(package=datasets)

To load one of the datasets, use its name as the argument to the data method: > data(iris) The data frame representing this dataset is now in the workspace.

R Objects The four most frequently used types of data objects in R are vectors, matrices, data frames and lists. A vector represents a set of elements of the same mode whether they are logical, numeric (integer or double), complex, character or lists. A matrix is a set of elements appearing in rows and columns where the elements are of the same mode whether they are logical, numeric (integer or double), complex or character. A data frame is similar to a matrix object but the columns can be different modes. A list is a generalization of a vector and represents a collection of data objects.

Creating vectors c Function The simplest way to create a vector is through the concatenation function, c. this function binds elements together, whether they are of character form, numeric or logical. Some examples of the use of the concatenation operator are shown in the following script. > value.num <- c(3, 4, 2, 6, 20) > value.char <- c(Bolga, Wa, Tamale) > value.logical <- c(F, F, T, T)

rep and seq Functions The rep function replicates elements of vectors. For example, > value <- rep(BEN, 10)

replicates the BEN, 10 times to create a vector called value, the contents of which are displayed after typing the vector name, value, and pressing the return key. The seq function creates a regular sequence of values to form a vector. The following script shows some simple examples of creating vectors using this function. > seq(from=2, to=10, by=2) > seq(from=2, to=10, length=5) > 1:20 > seq(along=value) c, rep and seq Functions as well as using each of these functions individually to create a vector, the functions can be used in combination. For example, > value <- c(1, 3, 4, rep(3,4), seq(from=1, to=6, by=2)) uses the rep and seq functions inside the concatenation function to create the vector, value it is important to remember that elements of a vector are expected to be of the same mode. So an expression > c(1:3, a, b, c) will produce an error message.

Scan function The scan function is used to enter in data at the terminal. This is useful for small datasets but tiresome for entering in large datasets. This is a good way to enter data easily as you can past in unformatted data values from other documents. > x <- scan() 1: 2: 3: 4: 2 19 3

To stop the scan simply leave an empty blank and press enter then the statement Read 3 items appears as in this case.

Creating matrices dim and matrix functions The dim function can be used to convert a vector to a matrix > value <- rnorm(6) > dim(value) <- c(2,3) > value This piece of script will fill the columns of the matrix. To convert back to a vector we simply use the dim function again. > dim(value) <- NULL Alternatively, we can use the matrix function to convert a vector to a matrix > matrix(value, 2, 3) If we want to fill by rows instead then we can use the following script > matrix(value,2,3, byrow=T)

rbind and cbind Functions To bind a row onto an already existing matrix , the rbind function can be used > value <- matrix(rnorm(6),2,3, byrow=T) > value2 <- rbind(value, c(1,1,2)) To bind a column onto an already existing matrix, the cbind function can be used > value3 <- cbind(value2, c(2,2,3))

data.frame function The function data.frame converts a matrix or collection of vectors into a data frame.

> value3 <- data.frame(value3) Another example joins two columns of data together. > value4 <- data.frame(rnorm(3), runif(3)) Row and column names are already assigned to a data frame but they may be changed using the names and row.names functions. To view the row and column names of a data frame: > names(value3) > row.names(value3) Alternatively, labels can be assigned by doing the following > names(value3) <- c(C1, C2, C3, C4) > row.names(value3) <- c(R1, R2, R3)

Factors Some variables are categorical: they can take only defined set of values. These variables are called factors and they are of two types: unordered (nominal) and ordered (ordinal). Factors are defined with the factor and ordered methods. They may be converted from existing character or numeric vectors with the as.factor and as.ordered method; these are often used after data import if the read.table or related methods could not correctly identify factors. The levels of an existing factor are extracted with the levels method. For example, suppose we have given three tests to each of three students and we want to rank the students. We might enter the data frame as follows: > student <- rep(1:3, 3) > score <- c(9,6.5,8,8,7.5,9.5,8,7) > tests <- data.frame(cbind(student, score)) > Str(tests) We have the data but the student is just listed by a number; the table method wont work and if we try to predict the score from the student using the lm method, we get nonsense: > lm(score ~ student, data=tests)

The problem is that the student is considered as a continuous variable when in fact it is a factor. We do much better if we make the appropriate conversion: > tests$student <- as.factor(tests$student) > str(tests) > lm(score ~ student, data=tests) Factor names can be any string; so to be more descriptive we could have assigned names with the labels argument to the factor method: > tests$student <- factor(tests$student, labels=c(Ben, Apam, Jnr))

Missing Data Anyone working with empirical data sooner or later deals with a dataset that has missing values. R treats missing values by using special NA value. You should encode missing data in R as NA and convert any data imports with missing data in other forms to NA as well, assuming you are not using a numerical convention (such as entering 0s). > missingdata <- c(1, 3, NA, 2, 1) If computations are performed on data objects with NA values the NA value is carried through to the result. If you have a computation problem with an element of a data object and are not sure whether that is a missing value, the function is.na can be used to determine if the element in question is a NA value. > is.na(missingdata[3])

Listing and deleting objects in memory When working in R and using many data objects, you may lose track of the names of the objects you have already created. Two different functions ls() and objects() have redundant functionality in R to list the current objects in current workspace memory. > ls () > objects () Sometimes you will want to remove specific objects from the workspace. This is easily accomplished with the remove function, rm(object) with the object name as the argument.

Editing data objects R has built in data editor which you can use to edit existing data objects. This can be particularly helpful to edit imported files easily to correct entries or if you have multiple data entries to edit beyond just simple editing of a particular entry. The data editor has a spreadsheet like interface, but has no spreadsheet functionality. To use the editor, use the data.entry function with the variable being edited as the argument. > data.entry(missingdata) Importing files The first thing to do before importing any file is to tell R what directory your file is in. > setwd(dir) Importing using the function read.*() The most convenient form to import data into R is to use the read functions, notable read.table(). This function will read in a flat file data file, created in ASCII text format. In Notepad you can simply save such a file as regular text file (extension *.txt). Many spreadsheet programs can save data in Notepad. Using read.table with arguments of file name and header=T (to capture column headings), such a file can easily be read in R as a data frame object: > example2 <- read.table(example1.txt, header=T) There are some additional read function variants. Notably read.csv() which will read comma delineated spreadsheet file data, which most spreadsheets can save files as.

Accessing Data There are several ways to extract data from a vector. Here is a summary using both slicing and extraction by a logical vector. Suppose x is the data vector, for example x = 1:10. Task ith element All but ith element First k elements Last k elements Specific elements Command X[2] X[-2] X[1:5] X[(length(x)-5):length(x)] X[c(1,3,5)]

All greater than some value Bigger than or less than some values Which indices are largest Exploratory data analysis

X[x>3] X[x< -2 | x>2] Which(x == max(x))

This section covers ways to quickly look at and summarize a dataset using R. Data summary functions in R Function Name Sum(x) Prod(x) Max(x) Min(x) Range(x) Length(x) Mean(x) Median(x) Var(x) sd(x) Cor(x,y) Quantile(x,p) Cov(x,y) The summary() function The summary function simultaneously calls many of the descriptive functions listed above and can be very useful when working with large datasets in data frames to present quickly some basic descriptive statistics. > summary(iris) This gives some quick quantitative information about the dataset without having to break up the data frame or do multiple function calls. Task Performed Sums the elements in x Product of the elements in x Maximum element in x Minimum element in x Range (min to max) of elements in x Number of elements in x Mean (average value) of elements in x Median (middle value) of elements in x Variance of elements in x Standard deviation of elements in x Correlation between x and y The pth quantile of x Covariance between x and y

Univariate Data

You might also like