R Programming Course Notes
R Programming Course Notes
Xing Su
Contents
Overview and History of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sequence of Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Subsetting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Partial Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Understanding Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Split-Apply-Combine Funtions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
split() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
apply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
lapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
sapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
vapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
tapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
mapply() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
aggregate() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
17
18
Base Graphics
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
19
Larger Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
19
20
Control Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
if - else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
while . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
22
22
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
Scoping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Scoping Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
25
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
R Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
History of S
Bell labs insightful Lucent Alcatel-Lucent
in 1998, S won the Association for computing machinerys software system award
History of R
1991
1993
1995
1996
1997
2000
R Features
freedom
freedom
freedom
freedom
to
to
to
to
R Drawbacks
40 year-old technology
little built-in support for dynamic/3D graphics
functionality based on consumer demand
objects generally stored in physical memory (limited by hardware)
Coding Standards
[1] at the beginning of the output = which element of the vector is being shown
character
numeric
integer
complex
logical
Numbers
numbers generally treated as numeric objects (double precision real numbers - decimals)
Integer objects can be created by adding L to the end of a number(ex. 1L)
Inf = infinity, can be used in calculations
NaN = not a number/undefined
sqrt(value) = square root of value
Variables
variable <- value = assignment of a value to a variable name
Vectors and Lists
atomic vector = contains one data type, most basic object
vector <- c(value1, value2, ...) = creates a vector with specified values
vector1*vector2 = element by element multiplication (rather than matrix multiplication)
if the vectors are of different lengths, shorter vector will be recycled until the longer runs out
computation on vectors/between vectors (+, -, ==, /, etc.) are done element by element by
default
%*% = force matrix multiplication between vectors/matrices
vector("class", n) = creates empty vector of length n and specified class
vector("numeric", 3) = creates 0 0 0
c() = concatenate
T, F = shorthand for TRUE and FALSE
1+0i = complex numbers
explicit coercion
as.numeric(x), as.logical(x), as.character(x), as.complex(x) = convert object from one
class to another
nonsensible coercion will result in NA (ex. as.numeric(c("a", "b"))
as.list(data.frame) = converts a data.frame object into a list object
as.character(list) = converts list into a character vector
implicit coercion
matrix/vector can only contain one data type, so when attempting to create matrix/vector with
different classes, forced coercion occurs to make every element to same class
least common denominator is the approach used (basically everything is converted to a class
that all values can take, numbers characters) and no errors generated
coercion occurs to make every element to same class (implicit)
x <- c(NA, 2, "D") will create a vector of character class
list() = special vector wit different classes of elements
5
"1"
"cx"
NA
"2"
"dsa"
# convert to matrix
dim(x) <- c(3, 2)
class(x)
## [1] "matrix"
x
##
[,1]
## [1,] NA
## [2,] "1"
## [3,] "cx"
[,2]
NA
"2"
"dsa"
every element of the list must correspond in length to the dimensions of the array
dimnames(x) <- list(c("a", "b"), c("c", "d"), c("e", "f", "g", "h", "i"))
set the names for row, column, and third dimension respectively (2 x 2 x 5 in this case)
dim() function can be used to create arrays from vectors or matrices
x <- rnorm(20); dim(x) <- c(2, 2, 5) = converts a 20 element vector to a 2x2x5 array
Factors
factors are used to represent categorical data (integer vector where each value has a label)
2 types: unordered vs ordered
treated specially by lm(), glm()
Factors easier to understand because they self describe (vs. 1 and 2)
factor(c("a", "b"), levels = c("1", "2")) = creates factor
levels() argument can be used to specify baseline levels vs other levels
Note:without explicit specification, R uses alphabetical order
table(factorVar) = how many of each are in the factor
Missing Values
NaN or NA = missing values
NaN = undefined mathematical operations
NA = any value not available or missing in the statistical sense
any operations with NA results in NA
NA can have different classes potentially (integer, character, etc)
Note: NaN is an NA value, but NA is not NaN
is.na(), is.nan() = use to test if each element of the vector is NA and NaN
Note: cannot compare NA (with ==) as it is not a value but a placeholder for a quantity that is
not available
sum(my_na) = sum of a logical vector (TRUE = 1 and FALSE = 0) is effectively the number of TRUEs
Removing NA Values
is.na() = creates logical vector where T is where value exists, F is NA
subsetting with the above result can return only the non NA elements
complete.cases(obj1, obj2) = creates logical vector where TRUE is where both values exist,
and FALSE is where any is NA
can be used on data frames as well
complete.cases(data.frame) = creates logical vectors indicating which observation/row is
good
data.frame[logicalVector, ] = returns all observations with complete data
Imputing Missing Values = replacing missing values with estimates (can be averages from all other
data with the similar conditions)
Sequence of Numbers
1:20 = creates a sequence of numbers from first number to second number
works in descending order as well
increment = 1
?':' = enclose help for operators
seq(1, 20, by=0.5) = sequence 1 to 20 by increment of .5
length=30 argument can be used to specify number of values generated
length(variable) = length of vector/sequence
seq_along(vector) or seq(along.with = vector) = create vector that is same length as another
vector
rep(0, times = 40) = creates a vector with 40 zeroes
rep(c(1, 2), times = 10) = repeats combination of numbers 10 times
rep(c(1, 2), each = 10) = repeats first value 10 times followed by second value 10 times
Subsetting
R uses one based index starts counting at 1
x[0] returns numeric(0), not error
x[3000] returns NA (not out of bounds/error)
[] = always returns object of same class, can select more than one element of an object (ex. [1:2])
[[]] = can extract one element from list or data frame, returned object not necessarily list/dataframe
$ = can extract elements from list/dataframe that have names associated with it, not necessarily same
class
Vectors
x[1:10] = first 10 elements of vector x
x[is.na(x)] = returns all NA elements
x[!is.na(x)] = returns all non NA elements
x > 0 = would return logical vector comparing all elements to 0 (TRUE/FALSE for all values except
for NA and NA for NA elements (NA a placeholder)
10
Lists
x <- list(foo = 1:4, bar = 0.6)
x[1] or x["foo"] = returns the list object foo
x[[2]] or x[["bar"]] or x$bar = returns the content of the second element from the list (in this case
vector without name attribute)
Note: $ cant extract multiple elements
x[c(1, 3)] = extract multiple elements of list
x[[name]] = extract using variable, where as $ must match name of element
x[[c(1, 3)]] or x[[1]][[3]] = extracted nested elements of list third element of the first object
extracted from the list
Matrices
x[1, 2] = extract the (row, column) element
x[,2] or x[1,] = extract the entire column/row
x[ , 11:17] = subset the x data.frame with all rows, but only 11 to 17 columns
when an element from the matrix is retrieved, a vector is returned
behavior can be turned off (force return a matrix) by adding drop = FALSE
x[1, 2, drop = F]
Partial Matching
works with [[]] and $
$ automatically partial matches the name (x$a)
[[]] can partial match by adding exact = FALSE
x[["a", exact = false]]
11
Logic
Understanding Data
use class(), dim(), nrow(), ncol(), names() to understand dataset
object.size(data.frame) = returns how much space the dataset is occupying in memory
head(data.frame, 10), tail(data.frame, 10) = returns first/last 10 rows of data; default = 6
summary() = provides different output for each variable, depending on class,
for numerical variables, displays min max, mean median, etx.
for categorical (factor) variables, displays number of times each value occurs
table(data.frame$variable) = table of all values of the variable, and how many observations there
are for each
Note: mean for variables that only have values 1 and 0 = proportion of success
str(data.frame) = structure of data, provides data class, num of observations vs variables, and name
of class of each variable and preview of its contents
compactly display the internal structure of an R object
Whats in this object
well-suited to compactly display the contents of lists
view(data.frame) = opens and view the content of the data frame
12
Split-Apply-Combine Funtions
loop functions = convenient ways of implementing the Split-Apply-Combine strategy for data analysis
split()
takes a vector/objects and splits it into group b a factor or list of factors
split(x, f, drop = FALSE)
x = vector/list/data frame
f = factor/list of factors
drop = whether empty factor levels should be dropped
interactions(gl(2, 5), gl(5, 2)) = 1.1, 1.2, . . . 2.5
gl(n, m) = group level function
n = number of levels
m = number of repetitions
split function can do this by passing in list(f1, f2) in argument
split(data, list(gl(2, 5), gl(5, 2))) = splits the data into 1.1, 1.2, . . . 2.5 levels
apply()
x = array
MARGIN = 2 (column), 1 (row)
FUN = function
... = other arguments that need to be passed to other functions
examples
lapply()
loops over a list and evaluate a function on each element and always returns a list
Note: since input must be a list, it is possible that conversion may be needed
lapply(x, FUN, ...) = takes list/vector as input, applies a function to each element of the list,
returns a list of the same length
x = list (if not list, will be coerced into list through as.list, if not possible > error)
data.frame are treated as collections of lists and can be used here
FUN = function (without parentheses)
anonymous functions are acceptable here as well - (i.e function(x) x[,1])
13
... = other/additional arguments to be passed for FUN (i.e. min, max for runif())
example
lapply(data.frame, class) = the data.frame is a list of vectors, the class value for each vector
is returned in a list (name of function, class, is without parentheses)
lapply(values, function(elem), elem[2]) = example of an anonymous function
sapply()
performs same function as lapply() except it simplifies the result
if result is of length 1 in every element, sapply returns vector
if result is vectors of the same length (>1) for each element, sapply returns matrix
if not possible to simplify, sapply returns a list (same as lapply())
vapply()
safer version of sapply in that it allows to you specify the format for the result
vapply(flags, class, character(1)) = returns the class of values in the flags variable in the
form of character of length 1 (1 value)
tapply()
split data into groups, and apply the function to data within each subgroup
tapply(data, INDEX, FUN, ..., simplify = FALSE) = apply a function over subsets of a vector
data = vector
INDEX = factor/list of factors
FUN = function
... = arguments to be passed to function
simplify = whether to simplify the result
example
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10); tapply(x, f, mean) = returns the mean of each group (f level) of x data
mapply()
multivariate apply, applies a function in parallel over a set of arguments
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE)
FUN = function
... = arguments to apply over
MoreArgs = list of other arguments to FUN
SIMPLIFY = whether the result should be simplified
example
mapply(rep, 1:4, 4:1)
14
##
##
##
##
##
##
##
##
##
##
##
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
aggregate()
aggregate computes summary statistics of data subsets (similar to multiple tapply at the same time)
aggregate(list(name = dataToCompute), list(name = factorVar1,name = factorVar2),
function, na.rm = TRUE)
15
Simulation
sample(values, n, replace = FALSE) = generate random samples
values = values to sample from
n = number of values generated
replace = with or without replacement
sample(1:6, 4, replace = TRUE, prob=c(.2, .2...)) = choose four values from the range
specified with replacing (same numbers can show up twice), with probabilities specified
sample(vector) = can be used to permute/rearrange elements of a vector
sample(c(y, z), 100) = select 100 random elements from combination of values y and z
sample(10) = select positive integer sample of size 10 without repeat
Each probability distribution functions usually have 4 functions associated with them:
r***
d***
p***
q***
function
function
function
function
(for
(for
(for
(for
If is the cumulative distribution function for a standard Normal distribution, then pnorm(q) = (q)
and qnorm(p) = 1 (q).
set.seed() = sets seed for randon number generator to ensure that the same data/analysis can be
reproduced
Simulation Examples
rbinom(1, size = 100, prob = 0.7) = returns a binomial random variable that represents the
number of successes in a give number of independent trials
1 = corresponds number of observations
size = 100 = corresponds with the number of independent trials that culminate to each resultant
observation
prob = 0.7 = probability of success
rnorm(n, mean = m, sd = s) = generate n random samples from the standard normal distribution
(mean = 0, std deviation = 1 by default)
16
17
Base Graphics
data(set) = load data
plot(data) = R plots the data as best as it can
x = variable, x axis
y = variable
xlab, ylab = corresponding labels
main, sub = title, subtitle
col = 2 or col = "red" = color
pch = 2 = different symbols for points
xlim,ylim(v1, v2) = restrict range of plot
boxplot(x ~ y, data = d) = creates boxplot for x vs y variables using the data.frame provided
hist(x, breaks) = plots histogram of the data
break = 100 = split data into 100 bins
18
read.table(), read.csv() = most common, read text files (rows, col) return data frame
readLines() = read lines of text, returns character vector
source(file) = read R code
dget() = read R code files (R objects that have been reparsed)
load(), unserialize() = read binary objects
writing data
write.table(), writeLines(), dump(), put(), save(), serialize()
read.table() arguments:
20
Control Structures
Common structures are
Note: Control structures are primarily useful for writing programs; for command-line interactive work,
the apply functions are more useful
if - else
# basic structure
if(<condition>) {
## do something
} else {
## do something else
}
# if tree
if(<condition1>) {
## do something
} else if(<condition2>) {
## do something different
} else {
## do something different
}
y <- if(x>3){10} else {0} = slightly different implementation than normal, focus on assigning value
for
# basic structure
for(i in 1:10) {
# print(i)
}
# nested for loops
x <- matrix(1:6, 2, 3)
for(i in seq_len(nrow(x))) {
for(j in seq_len(ncol(x))) {
# print(x[i, j])
}
}
for(letter in x) = loop through letter in character vector
21
22
Functions
name <- function(arg1, arg2, ...){ }
structure
f <- function(<arguments>) {
## Do something interesting
}
function are first class object and can be treated like other objects (pass into other functions)
functions can be nested, so that you can define a function inside of another function
function have named arguments (i.e. x = mydata) which can be used to specifiy default values
sd(x = mydata) (matching by name)
formal arguments = arguments included in the functional definition
formals() = returns all formal arguments
not all functional call specifies all arguments, some can be missing and may have default values
args() = return all arguments you can specify
multiple arguments inputted in random orders (R performs positional matching) not recommended
argument matching order: exact partial positional
23
Scoping
scoping rules determine how a value is associated with a free variable in a function
free variables = variables not explicitly defined in the function (not arguments, or local variables variable defined in the function)
R uses lexical/static scoping
common alternative = dynamic scoping
lexical scoping = values of free vars are searched in the environment in which the function is
defined
environment = collection of symbol/value pairs (x = 3.14)
each package has its own environment
only environment without parent environment is the empty environment
closure/function closure = function + associated environment
search order for free variable
1.
2.
3.
4.
5.
when a function/variable is called, R searches through the following list to match the first result
1.
2.
3.
4.
5.
6.
7.
8.
9.
.GlobalEnv
package:stats
package:graphics
package:grDeviced
package:utils
package:datasets
package:methods
Autoloads
package:base
order matters
.GlobalEnv = everything defined in the current workspace
any package that gets loaded with library() gets put in position 2 of the above search list
namespaces are separate for functions and non-functions
possible for object c and function c to coexist
Scoping Example
make.power <- function(n)
pow <- function(x) {
x^n
}
pow
}
cube <- make.power(3)
#
square <- make.power(2) #
cube(3)
#
24
## [1] 27
square(3)
# defines x = 3
## [1] 9
# returns the free variables in the function
ls(environment(cube))
## [1] "n"
"pow"
25
Optimization
optimization routines in R (optim, nlm, optimize) require you to pass a function whose argument is a
vector of parameters
Note: these functions minimize, so use the negative constructs to maximize a normal likelihood
constructor functions = functions to be fed into the optimization routines
example
# write constructor function
make.NegLogLik <- function(data, fixed=c(FALSE,FALSE)) {
params <- fixed
function(p) {
params[!fixed] <- p
mu <- params[1]
sigma <- params[2]
a <- -0.5*length(data)*log(2*pi*sigma^2)
b <- -0.5*sum((data-mu)^2) / (sigma^2)
-(a + b)
}
}
# initialize seed and print function
set.seed(1); normals <- rnorm(100, 1, 2)
nLL <- make.NegLogLik(normals); nLL
## function(p) {
##
params[!fixed] <- p
##
mu <- params[1]
##
sigma <- params[2]
##
a <- -0.5*length(data)*log(2*pi*sigma^2)
##
b <- -0.5*sum((data-mu)^2) / (sigma^2)
##
-(a + b)
##
}
## <environment: 0x7fda426462a8>
# Estimating Prameters
optim(c(mu = 0, sigma = 1), nLL)$par
##
mu
sigma
## 1.218239 1.787343
# Fixing sigma = 2
nLL <- make.NegLogLik(normals, c(FALSE, 2))
optimize(nLL, c(-1, 3))$minimum
## [1] 1.217775
# Fixing mu = 1
nLL <- make.NegLogLik(normals, c(1, FALSE))
optimize(nLL, c(1e-6, 10))$minimum
## [1] 1.800596
26
Debugging
message: generic notification/diagnostic message, execution continues
message() = generate message
warning: somethings wrong but not fatal, execution continues
warning() = generate warning
error: fatal problem occurred, execution stops
stop() = generate error
condition: generic concept for indicating something unexpected can occur
invisible() = suppresses auto printing
Note: random number generator must be controlled to reproduce problems (set.seed to pinpoint
problem)
traceback: prints out function call stack after error occurs
must be called right after error
debug: flags function for debug mode, allows to step through function one line at a time
debug(function) = enter debug mode
browser: suspends the execution of function wherever its placed
embedded in code and when the code is run, the browser comes up
trace: allows inserting debugging code into a function at specific places
recover: error handler, freezes at point of error
options(error = recover) = instead of console, brings up menu (similar to browser)
R Profiler
optimizing code cannot be done without performance analysis and profiling
# system.time example
system.time({
n <- 1000
r <- numeric(n)
for (i in 1:n) {
x <- rnorm(n)
r[i] <- mean(x)
}
})
##
##
user
0.148
system elapsed
0.004
0.173
system.time(expression)
takes R expression, returns amount of time needed to execute (assuming you know where)
computes time (in sec) gives time until error if error occurs
can wrap multiple lines of code with {}
returns object of class proc_time
user time = time computer experience
27
28