Dzone R Refcard
Dzone R Refcard
232
R Essentials
Introduction
Installation and IDE
Starting With R
Data Structures
BY G. RYA N S PA I N
INTRODUCTION
OPE R ATOR
WHAT IS R?
R is a highly extensible, open-source programming language used
mainly for statistical analysis and graphics. It is a GNU project
very similar to the S language. Rs strengths include its varying
data structures, which can be more intuitive than data storage in
other languages; its built-in statistical and graphical functions;
and its large collection of useful plugins that can enhance the
languages abilities in many different ways.
Addition symbol.
Subtraction symbol.
Multiplication symbol.
Division symbol.
Exponent symbol.
%%
()
DECLARING VARIABLES
USES
Variables in R are defined using the <- operator. You can consider
the <- as an arrow pointing from the value of the variable on the
right to the variable name on the left. So the expression x <- 15
would store the value 15 as the variable x. When you declare
a variable in R, it does not automatically print the variable or
its value; that is, the interface does not return anything when
a variable is declared, and will simply ready itself for the next
command. To view the contents of a variable in R, use the variable
name with no additional expressions or functions and execute the
command. This will display the value of the variable.
> x <- 15
> x
[1] 15
Here, the > denotes the command input, and again the output is
printed with its index position within the output.
VECTORS
A vector is the most basic object in R. An atomic vector is a linear
(f lat) collection of values of one basic type. The types of atomic
vector are: logical, integer, double, complex, character, and raw.
RSTUDIO
RStudio is a popular open-source integrated development
environment (IDE) for R. It includes a console for directly
executing R commands, as well as an editor for building longer R
scripts. It is also able to keep track of and view variable data and
access documentation and R graphics in the same environment.
Moreover, RStudio allows you to enable additional R packages
through the interface without a command. RStudio is available
for Windows, Mac, and several Linux operating systems. You can
download RStudio at rstudio.com/products/rstudio/download.
R ESSENTIALS
DE SCRIPTION
STARTING WITH R
BASIC MATHEMATICAL OPERATIONS
At its simplest, R can function like a calculator. Besides basic
operators or functions, R does not need code to execute basic
calculations. The line 4 + 5 would return the result [1] 9. Since R
often deals with lengthy, possibly tabular datasets, its output includes
index positionsin this case, the [1] that printed with our result.
D Z O NE, INC.
DZ O NE.C O M
https://www.predix.io/registration/
GE Digital
3
ATOM IC
VEC TOR T YPE
DE SCRIPTION
Logical
Integer
Double
Complex
Character
Notice how the first two operations occurred naturally, but at the third
operation, there was no third element of the shorter vector to use.
Therefore, R started over at the beginning of the shorter vector and
continued the operation. Once the fifth element of the longer vector
was reached, R repeated the process.
Raw
To gather data as a vector, use the c() function, which combines its
arguments into a vector. Heres a basic example:
LISTS
Lists are vectors that allow their elements to be any type of object. They
are created using the list() function.
> x <- list(1, "two", c(3, 4))
Note: To learn more about a function in R, you can use the ? operator or
the help() function. This will give you more information, including a
description, usage, and arguments. To learn more about c(), you can
enter ?c or help(c). For more on help(), enter ?help or help(help).
> str(x)
List of 3
$ : num 1
$ : chr "two"
$ : num [1:2] 3 4
We can get the same result by using the : operator, which will create
a series from the value of its first argument to the value of its second
argument. When using : you do not use the c() function to combine the
data, as this is done automatically.
> x <- 1:4
> x
[1] 1 2 3
R ESSENTIALS
> str(x)
List of 3
$ : num 1
$ : chr "two"
$ :List of 2
..$ : num 3
..$ : num 4
"3"
This means that a list is a recursive object (you can test this with the
is.recursive() function). Lists can be hypothetically nested
indefinitely.
FACTORS
A factor is a vector that stores categorical datadata that can be
classified by a finite number of categories. These categories are known as
the levels of a factor.
D Z O NE, INC .
Say you define x as a collection of the strings "a", "b", and "c":
x <- c("b", "c", "b", "a", "c", "c").
DZ O NE .C O M
4
Using the factor() function, you can have R convert the atomic
character vector into a factor. R will automatically attempt to determine
the levels of the factor; this will produce an error when factor is given
an argument that is non-atomic. Lets take a look at the factor here:
[1,]
[2,]
[3,]
[4,]
[5,]
The matrix will fill by column unless the argument byrow is set to TRUE.
Note that the position indexes are assigned to rows and columns here.
Since a matrix is naturally two-dimensional, R provides column indexes
to more easily interact with the matrix. You can use the index vector
[] to return the value of an individual cell of the matrix. x[1,2] will
return the value of row one, column 2: 6. You can also use the index
vector to return the values of whole rows or columns. x[1,] will return
1 6 11 16, the elements of the first row of the matrix.
You can also create a matrix by assigning dimensions to a vector using
the dim() function, as shown here:
x <- 1:20
dim(x) <- c(5, 4)
This created the same matrix you saw earlier. With the dim() function,
you can also redefine the dimensions of a matrix. dim(x) <- c(4,5)
will redraw the matrix to have four rows and five columns.
ARRAYS
What happens if the vector you passed to the dim() function had more
than two elements? If we had written dim(x) <- c(5, 2, 2) we would
have created another data structure: an array.
Technically, a matrix is specifically a two-dimensional array, but
arrays can have unlimited dimensions. When x contained 20
elementsx <- 1:20executing dim(x) <- c(5, 2, 2) would have
given x three dimensions. R would represent this as a series of matrixes:
The tables() function gives a table summarizing the factor. Using the
table() function on x returned the name of the variable, a list of the
levels of x, and then, underneath, the number of values that occurs in x
corresponding with the above level. So this table shows us that, in the
factor x, there are three instances of the level "a", two instances of "b",
and one instance of "c".
> x
, , 1
[,1] [,2]
[1,]
1
6
[2,]
2
7
[3,]
3
8
[4,]
4
9
[5,]
5
10
, , 2
[,1] [,2]
[1,]
11
16
[2,]
12
17
[3,]
13
18
[4,]
14
19
[5,]
15
20
If the levels of your factor need to be in a particular order, you can use
the factor() argument levels to define the order, and set the argument
ordered to TRUE:
> x <- c("b", "a", "b", "c", "a", "a")
> x <- factor(x, levels = c("c", "b", "a"), ordered = TRUE
> x
[1] b a b c a a
Levels: c < b < a
> str(x)
Ord.factor w/ 3 levels "c"<"b"<"a": 2 3 2 1 3 3
> levels(x)
[1] "c" "b" "a"
> table(x)
x
c b a
1 2 3
In the case of an array, the row and column numbers remain in the
same order, and R will show the other dimensions above each matrix.
In this case, we received two matrixes (based on the third dimension
given) of five rows (based on the first dimension given) and two
columns (based on the second dimension given). R displays arrays in
order of each dimension givenso if we had an array of four
dimensions (say 5, 2, 2, 2), it would print matrixes , , 1, 1, then
, , 1, 2, then, , 2, 1, and lastly , , 2, 2.
Now R returned the levels in the order specified by the vector given to
the levels argument. The < (less than) symbol in the output of x and
str(x) indicate that these levels are ordered, and the str(x) function
reports that the object is an ordered factor.
MATRIXES
A matrix is, in most cases, a two-dimensional atomic data structure
(though you can have a one-dimensional matrix, or a non-atomic
matrix made from a list). To create a matrix, you can use the
matrix() function on a vector with the nrow and/or ncol arguments.
matrix(1:20, nrow = 5) will produce a matrix with five rows
and four columns containing the numbers one through twenty.
matrix(1:20, ncol = 4) produces the same matrix.
D Z O NE, INC .
R ESSENTIALS
DATA FRAMES
A data frame is a (generally) two-dimensional structure consisting of
DZ O NE .C O M
R ESSENTIALS
over many of Rs helpful functions here. Remember that you can use the
vectors of the same length. Data frames are used often, as they are the
closest data structure in R to a spreadsheet or relational data tables. You
can use the data.frame() function to create a data frame.
In this example, we have created a data frame with two columns and
three rows. Using y = and z = defines the names of the columns, which
will make them easier to access, manipulate and analyze. Here, weve
used the argument stringsAsFactors = FALSE to make column z an
atomic character vector instead of a factor. By default, data frames will
coerce vectors of strings into factors.
You can use the names() function to change the names of your columns.
names(x) <- c("a", "b") provides a vector of new values to replace
the column names, changing the columns to a and b. To change a certain
column or columns, you can use the index vector to specify which
column(s) to rename.
> names(x)[1] <- "a"
> x
a
z
1 1
one
2 2
two
3 3 three
You can combine data frames with the cbind() function or the
rbind() function. cbind() will add the columns of one data frame to
another, as long as the frames have the same number of rows.
> cbind(x, b = data.frame(c("I", "II", "III"), stringsAsFactors =
FALSE)))
a
z
b
1 1
one
I
2 2
two II
3 3 three III
rbind() will add the rows of one data frame to the rows of another, so
long as the frames have the same number of columns and have the
same column names.
FUNC TION
DE SCRIPTION
summary()
str()
dim()
levels()
length()
names()
class()
attributes()
cbind() and rbind() will also coerce vectors and matrixes of the
proper lengths into a data frame, so long as one of the arguments of the
bind function is a data frame. We could have used
rbind(x, c(4, "four")) to take the data frame x we defined earlier,
and coerce the vector c(4, "four") to fit into the existing data frame.
But coercion can affect the way your data frame stores your data. In this
case, the vector c(4, "four") would have coerced the integer 4 into the
character "4". Then the data frame would have coerced the entire first
column into a character vector. This makes it safer to use rbind() and
cbind() to bind data frames with each other.
object.
size()
order()
rank()
head()
tail()
MANIPULATING DATA
FUNCTIONS
FUNC TION
D Z O NE, INC .
seq()
E X AM PLE S
seq(x, y, by = z)
DZ O NE .C O M
DE SCRIPTION
Increments x by z until y is reached/
surpassed. seq(0, 10, by = 5) returns
0 5 10.
6
FUNC TION
seq()
E X AM PLE S
seq(x, y, length
= z)
DE SCRIPTION
rep(x, times = y)
rep()
rep(x, each = y)
paste()
t()
rbind()
cbind()
strsplit()
nchar()
substr()
sort()
FUNC TION
rep()
exp()
t(x)
rbind(x, y)
cbind(x, y)
strsplit(x, "regex")
nchar(c(x, y))
substr(x, y, z)
sort(x)
abs()
ceiling()
cor()
cos(),
sin(),
tan(),
acos(),
asin(),
atan()
atan2(),
cospi(),
sinpi(),
tanpi()
cov(),
var()
sd()
trunc()
median()
max(),
min()
mean()
DE SCRIPTION
floor()
cummax(),
cummin(),
cumprod(),
cumsum()
MATH FUNCTIONS
FUNC TION
DE SCRIPTION
log(),
log10(),
log2()
range()
R ESSENTIALS
STATISTICAL FUNCTIONS
FUNC TION
fitted()
DE SCRIPTION
Returns model fitted value from the argument.
predict()
resid()
D Z O NE, INC .
DZ O NE .C O M
7
FUNC TION
lm()
glm()
deviance()
DE SCRIPTION
Fits a linear model based on a function given as an argument.
coef()
confint()
vcov()
You can also set default values for the arguments passed to your
function. To do so, name the function, then type = and then the default
value. Giving an argument a default value makes that argument optional.
FUNC TION
_norm()
DE SCRIPTION
Normal distribution.
_binom()
Binomial distribution.
_pois()
Poisson distribution.
R ESSENTIALS
LOGICAL OPERATORS
OPE R ATOR
<
<=
_exp()
Exponential distribution.
_chisq()
Chi-Squared distribution.
>
_gamma()
Gamma distribution.
_unif()
Unified distribution.
There are many built-in functions in R, and many we could not even list
here. If you are unable to find a function you need, though, R allows you
to create your own function using the function() function. Functions
can be created in-console, but often more complex functions are easier to
write as .R scripts, which you can run, copy, or alter as you need.
D Z O NE, INC .
==
!=
&
&&
!
||
>=
DE SCRIPTION
OR operator.
OR operator that evaluates the leftmost element of a vector.
AND operator (evaluated before OR operators).
AND operator that evaluates the leftmost element of a vector.
NOT operator.
DZ O NE .C O M
R ESSENTIALS
LOGICAL FUNCTIONS
FUNC TION
FUNC TION
E X AM PLE S
DE SCRIPTION
isTRUE()
isTRUE(x)
xor()
xor(x, y)
which(x)
any(x)
which()
any()
all()
all(x)
ls()
list.
files() or
dir()
dir.
create()
setwd()
There are several ways to isolate certain pieces of data from larger data
sets. Index vectors are one way you can do this. There are four types of
index vector, and each is accessed by placing square brackets [] directly
next to the name of the data structure you want to access.
T YPE
E X AM PLE S
Logical
x[x > 0]
Positive
Integer
x[y]
x[-y]
x["y"]
Negative
Integer
Named
write.
csv()
getwd()
file.
rename()
Renames a file.
file.
remove()
Deletes a file.
GRAPHICAL FUNCTIONS
R is known for its extensive, easy-to-use graphical functions. Here are
a few to get you started. Packages gplot and ggplot2 can help you create
even more customized graphics. These are just a few basic graphical
functions you can use in R. While for the sake of length, we cant go
over all the graphs you can create here, or all the arguments you can
use to customize them, you should get a sense of what kind of graphs
you can create in R.
FUNC TION
DE SCRIPTION
Reads the contents of the specified .csv file.
hist()
barplot()
boxplot()
heatmap()
lines()
DE SCRIPTION
plot()
dotchart()
read.csv()
file.
exists()
unlink()
FUNC TION
file.
copy()
DE SCRIPTION
file.
create()
file.
info()
INDEX VECTORS
DE SCRIPTION
D Z O NE, INC .
DZ O NE .C O M
R ESSENTIALS
PACK AG E
DE SCRIPTION
swirl
ggplot2
RColorBrewer
data.table
plyr
RE SOURCE
USEFUL RESOURCES
Theres a lot more you can learn about and do with R than we can cover
in this Refcard. But there are a lot of resources out there to help you
learn more, and there are also a lot of R packages that can make R even
more powerful. Try looking into these resources and packages to step
up your R game. Use the install.packages() function to download a
package (just put the package name in quotation marks as the functions
argument). Youll need to load packages on new R sessions using the
library() function, or by using your IDE (in RStudio, you can select
checkboxes in the packages tab to load other installed packages).
DE SCRIPTION
r-bloggers.com
datacamp.com
This site has some in-browser tutorials that can help you dig
deeper into R.
inside-r.org
JOIN NOW
DZONE, INC.
150 PRESTON EXECUTIVE DR.
CARY, NC 27513
888.678.0399
919.678.0300
DZone communities deliver over 6 million pages each month to more than 3.3 million software
developers, architects and decision makers. DZone offers something for everyone, including news,
tutorials, cheat sheets, research guides, feature articles, source code and more.
SPONSORSHIP OPPORTUNITIES
DZ Osales@dzone.com
NE .C O M
Copyright 2016 DZone, Inc. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
D Zpermission
O NE, INC
transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
of the. publisher.
VERSION 1.0