R Basic
R Basic
Introduction to R
Facilitator: Proloy Barua, Ph.D.
Overview
About the course
R is the most popular programming language in the data industry. It uses vectors and a
variety of pre-processed packages. It’s in high demand for Data Scientists, Analysts, and
Statisticians alike. This introduction to R course covers the basics of this open-source
language, including vectors, factors, lists, and data frames. You’ll gain useful coding skills
and be ready to start your own data analysis in R. The “R” name is derived from the first
letter of the names of its two developers, Ross Ihaka and Robert Gentleman, who were
associated with the University of Auckland at the time”. In 1991, R was created by Ross
Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland.
Course structure
There will be six lessons. The course will be started with basic operations, like installing R
and RStudio, using the console as a calculator (lesson-1) and understanding basic data types
in R (lesson-2). Then we will move on to basic data structures (lesson-3) and Indexing to
extract values of a variable in a dataset (lesson-4). Next, we will learn how to do vector
algebra and matrices in R (lesson-5). Finally, we will learn data exploration, cleaning data
and plotting in R using simple R codes (lesson-6).
Course learning outcomes
Upon completion of this Introduction to R course, learners will be able to use the R basics for
their own data analysis. These sought-after skills can help you progress in your career and set
you up for further self-learning.
Facilitator
Dr Proloy Barua, Assistant Scientist, BRAC James P Grant School of Public Health, BRAC
University
Getting help in R
Within R, it has a facility to searching for help and documentation. # (hashtag) sign will make
your R commands as text. You can write any texts with the use of # sign as follows
• help.search(“mean”) #search for specific subject
• find(“mean”) #search for packages related to any subject
Some Basics of R using keyboard
• Ctrl+Enter #for execution of commands or arguments
• Ctrl+l #To clear console window
• Ctrl+a #To clear first line
• Ctrl+e #To clear last line
• Ctrl+u #To clear current line
• Ctrl+c #To copy
• Ctrl+v #To paste
• rm(list=ls()) # Clean up everything
• getwd() # Get working directory
• setwd(d) # Setting path of working directory
Free Online Resources
• Installing R (https://github.com/genomicsclass/windows#installing-r)
• R Studio (https://www.rstudio.com/products/rstudio/download/)
• R Studio Cheat Sheets (https://rstudio.cloud/learn/cheat-sheets)
• Introduction to R by Robert J. Hijmans available
at https://rspatial.org/intr/IntroductiontoR.pdf
1
• An Introduction to R by W. N. Venables, D. M. Smith and the R Core
Team https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
• R for Beginners by Emmanuel Paradis http://cran.r-project.org/doc/contrib/Paradis-
rdebuts_en.pdf
• R tutorial by Kelly Black http://www.cyclismo.org/tutorial/R/
• A brief overview by Ross Ihaka (one of the originators of
R) https://www.stat.auckland.ac.nz/~ihaka/120/Notes/ch02.pdf
• Information Visualization course by Ross
Ihaka https://www.stat.auckland.ac.nz/~ihaka/120/notes.html
• A Beginner’s Guide to R by Zuur, Leno and
Meesters http://www.springer.com/us/book/9780387938363
• R in a nutshell by Joseph Adler http://shop.oreilly.com/product/0636920022008.do
• The Art of R Programming by Norman Matloff http://www.nostarch.com/artofr.htm
• Introduction to R by Datacamp https://www.datacamp.com/courses/free-introduction-
to-r
• Good material on rstatistics.net http://rstatistics.net/
• Watch some Google Developers videos
http://www.youtube.com/playlist?list=PLOU2XLYxmsIK9qQfztXeybpHvru-TrqAP
• Advanced R by Hadley Wickham http://adv-r.had.co.nz/
• StackOverflow https://stackoverflow.com
• R-Bloggers https://www.r-bloggers.com
• Rseek https://rseek.org/ (served as R search engine powered by google)
• Stackexchange https://stackexchange.com
2
Course outline
Sessions Contents Sub-contents
Lession-1 Introduction 1. Installing R and RStudio
2. How it works
3. 4 Panes in RStudio
Pane 1: View Files and Data
Pane 2: See Workspace and History
Pane 3: See Files, Plots, Packages, and Help
Pane 4: Console
4. Open/save R script file
5. Use # symbol for comments
6. assignment operator <-
7. Used Ctrl + enter in keyboard for run
Lession-2 Basic data 1. Numeric values
types in R 2. Integer values
3. Character values
4. Logical values
5. Factors
6. Missing values
7. Time
Lession-3 Basic data 1. Matrix
structures 2. List
3. Data frame
Lession-4 Indexing 1.Vector
2. Matrix
3. List
4. Data frame
5. Which, %in%, and match
Lession-5 Algebra 1. Vector algebra
Multiplication works element by element (same length)
Multiplication works element by element (different length) -
shorter ones will be “recycled”
2. Logical comparisons
== is used to test for equality
& is Boolean “AND”, and | is Boolean “OR”.
“Less than or equal” is <=, and “more than or equal” is >=.
3. Functions
sqrt(x)
exp(x)
min(x)
max(x)
range(x)
sum(x)
mean(x)
median(x)
sd(x)
prod(x)
4. Random numbers
runif(10), To get 10 numbers sampled from the uniform
distribution between 0 and 1
3
rnorm(10, mean=10, sd=2), Normally distributed numbers
set.seed(12), To take exactly the same “random” sample each
time
5. Matrices algebra
m <- matrix(1:6, ncol=3, nrow=2, byrow=TRUE)
Arguments within a function such as ncol, nrow, byrow
With matrix m you can do m * 5 to multiply all values of m
with 5.
Math with matrix and a vector.
multiply two matrices (m*m)
Lession-6 Data 1. Summary and table/ data cleaning
exploration d <- data.frame(id=1:10, name=c('Bob', 'Bobby', '???', 'Bob',
'Bab', 'Jim', 'Jim', 'jim', '', 'Jim'), score1=c(8, 10, 7, 9, 2, 5, 1,
6, 3, 4), score2=c(3,4,5,-999,5,5,-999,2,3,4),
stringsAsFactors=FALSE)
d
summary(d)
Which values in score2 are -999? use i <- d$score2 == -999
To set these to NA, use d$score2[i] <- NA
It can be done using single line of code, d$score2[d$score2
== -999] <- NA
summary(d)
For character (and integer) variables it can be useful to use
unique and table functions
unique(d$name)
table(d$name)
d$name[d$name %in% c('Bab', 'Bobby')] <- 'Bob', To replace
‘Bab’ and ‘Bobby’ with ‘Bob’.
d$name[d$name %in% c('Bab', 'Bobby')] <- 'Bob'
table(d$name)
d$name[d$name %in% ‘jim’] <- ‘Jim’, To replace ‘jim’ with
‘Jim’
table(d$name)
d$name[d$name == '???'] <- NA
table(d$name)
table(d$name, useNA='ifany'), To force table to also count the
NA values
Note that there is one ‘empty’ value, d$name[9]
d$name[d$name == ''] <- NA, to replace ‘empty’ value with
NA (missing value)
table(d[ c('name', 'score2')]), to use table() to make a
contingency table of two variables
2. Quantile, range, and mean
quantile(d$score1)
range(d$score1)
mean(d$score1)
you may need to use na.rm=TRUE if there are NA values
quantile(d$score2)
range(d$score2)
quantile(d$score2, na.rm=TRUE)
4
range(d$score2, na.rm=TRUE)
3. Make plots
par(mfrow=c(2,2)), To set up the canvas for two rows and
columns
par(mfrow=c(2,2))
plot(d$score1, d$score2)
boxplot(d[, c('score1', 'score2')])
plot(sort(d$score1))
hist(d$score2)
5
Presentation
6
7
8
9
R Markdown File
R Scripts
Numeric values
a <- 7 # one element
show(a)
print(a)
a
class(a)
length(a) # to see how many elements or observations in the vector
rm(a) # Remove any variable or file. Now try this function show(a)
Integer values
b <- 7L
b
class(b)
Character values
x <- “Proloy”
x
class(x)
Logical values
x <- FALSE
y<- TRUE
x
y class(x)
class (y)
Factors
countries <- c(‘Bangladesh’, ‘Bangladesh’, ‘India’, ‘Afghanistan’, ‘India’)
countries
class(countries)
f1 <- as.factor(countries) # converting character values into factor values
f1
class(f1)
Missing values
m <- c(2, NA, 5, 2, NA, 2) # NA (“Not Available”) (e.g. missing value = .)
10
is.na(m) # To check NA or missing values
class(m)
which(is.na(m)) # Get positions of NA
n <- c(5, 9, NaN, 3, 8, NA, NaN) # NaN (“Not a Number”) (e.g. 0 / 0)
is.nan(n) # To check NaN values
class(n)
which(is.nan(n)) # Get positions of NaN
Time
d<- Sys.Date()
d
class(d)
Matrix
A two-dimensional rectangular layout is called a matrix. We can create a matrix with
two rows and three columns using following codes
m <- matrix(ncol=3, nrow=2)
m
Note that all values were missing (NA) in above matrix. Let’s make a matrix with values
1 to 6
m <- matrix(data=c(1:6), ncol=3, nrow=2, byrow = TRUE) # Arguments- like parameters, are
are information passed to functions.
m
m <- matrix(data=c(1:6), ncol=3, nrow=2, byrow = FALSE) # By default elements are
arranged sequentially by column.
m
t(m) # switching the number of columns and rows and using the t (transpose) function
A matrix can only store a single data type. If you try to mix character and numeric
values, all values will become character values (as the other way around may not be
possible)
vchar <- c(“a”, “b”)
class(vchar)
vnumb <- c(1,2)
class(vnumb)
matrix(c(vchar,vnumb), ncol=2, nrow=2, byrow = FALSE)
m <- matrix(data=c(1:6), ncol=3, nrow=2, byrow = FALSE) # Define the column and row
names in matrix m
m
rownames(m) = c(“row1”, “row2”) # Row names are less important.
colnames(m) = c(“ID”, “X”, “Y”)
m
class(m)
List
A list in R is similar to your to-do list at work or school a list is some kind super data
type
11
v <- c(1:10)
m <- matrix(data=c(1:6), ncol = 3, nrow=2)
c <- “abc”
l<- list(v, m, c)
names(l) <- c(“first”, “second”, “third”) # Naming of list elements
print(l)
class(l)
Data frame
It is rectangular like a matrix, but unlike matrices a data.frame can have columns
(variables) of different data types such as numeric, character, factor. Let’s create a data
frame with the following four variables or vectors
ID <- as.integer(c(1,2,3,4))
name <- c(“name1”, “name2”, “name3”, “name4”)
sex <- as.factor(c(“Female”,“Male”,“Male”,“Female”))
age <- as.numeric(c(36, 27, 37, 32))
df <- data.frame(ID, name, sex, age, stringsAsFactors=FALSE)
print(df)
class(df)
str(df) # to see the data structure
Vector
Access element(s) of a vector
b <- c(10:15)
b
b[1] # Get the first element of a vector
b[-2] # Get all elements except the second
b[1] <- 11 # use an index to change values
b[3:6] <- -99 # use an index to change values
b
Matrix
values of matrices can be accessed through indexing
m <- matrix(1:9, nrow=3, ncol=3, byrow=TRUE)
colnames(m) <- c(‘a’, ‘b’, ‘c’)
m
use two numbers in a double index, the first for the row number(s) and the second for
the column number(s).
m[2,2]
m[ ,2] # entire column
m[, c(‘a’, ‘c’)] # two columns
m[1,1] <- 5 # setting values
List
v <- c(1:10)
m <- matrix(data=c(1:6), ncol = 3, nrow=2)
c <- “abc”
l<- list(v, m, c)
12
names(l) <- c(“first”, “second”, “third”) # Naming of list elements
print(l)
class(l)
l$first # the first elements can be extracted by using the $ (dollar) operator
l$second
l$third
l[[“first”]] # to extract elements of first vector
Data frame
m <- matrix(1:9, nrow=3, ncol=3, byrow=TRUE) # create a data.frame from matrix m
colnames(m) <- c(‘a’, ‘b’, ‘c’)
d <- data.frame(m)
class(d)
d[,2] # extract a column by column number
d[, ‘b’] # use the column name to get values
d[ , ‘b’, drop=FALSE] # to make the output a one dimensional matrix
Which
When we need to find indices of the elements in a vector that have values above 15? The
function which() gives us the entries of a logical vector that are true.
x <- c(10:20)
i <- which(x > 15)
print(i)
x[i]
%in%
A very useful operator that allows you to ask whether a set of values is present in a
vector is %in%.
x <- c(10:20)
j <- c(7,9,11,13)
j %in% x
which(j %in% x)
Match
The function match() looks for entries in a vector and returns the index needed to access them
match(j, x) # Another handy similar function is match
Logical comparisons
a == 2
b>6&b<8
b>9|a<2
b >= 9
a <= 2
13
b >= 9 | a <= 2
b >= 9 & a <= 2
Functions
sqrt(a)
exp(a)
min(a)
max(a)
range(a)
sum(a)
mean(a)
median(a)
prod(a)
sd(a)
Random numbers
r <- runif(10) # for uniform distributed numbers
r <- rnorm(10, mean=10, sd =2) # for randomly distributed numbers
To be able to exactly reproduce examples or data analysis we often want to assure that
we take exactly the same “random” sample each time we run our code.
set.seed(n)
Matrices
m <- matrix(1:6, ncol=3, nrow=2, byrow=TRUE) # Create an example matrix
print(m)
m*5 #to multiply all values of m with 5
m*m # multiply two matrices
m * 1:2 # We can also do math with a matrix and a vector
14
d $name[d $name == ‘???’] <- NA # to replace ‘???’ with NA
table(d$name)
table(d$name, useNA=‘ifany’) # To force table to also count the NA values.
d$name[9]
Note that there is one ‘empty’ value in the dataset. to replace ‘empty’ value with NA
(missing value) Note that somehow $ symbol is not appearing after df. So I add space to
appear it
d $name[d $name == ’’] <- NA # to replace empty value ’’ with NA
table(d[ c(‘name’, ‘score2’)]) # to see frequency table of two variables
15
df $major <- as.factor(df $major) # converting character to factor
df $laptop <- as.factor(df $laptop) # converting character to factor
df <- df[-1] # removing first variable
df
df$ID <- 1:nrow(df) # creating a new variable called ID
df
data.table::setcolorder(df, neworder = “ID”) # Ordering variables starting with ID
df
str(df)
16