Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

R Language Workshop

Download as pdf or txt
Download as pdf or txt
You are on page 1of 134

R Programming

• 'R' is a programming language for data analysis and statistics.


• It is free, and very widely used by professional statisticians.
• R is a dynamically typed interpreted language, and is typically used
interactively.
• It has many built-in functions and libraries, and is extensible, allowing users
to define their own functions and procedures using R
Features of R

• It is an open-source tool
• R supports Object-oriented as well as Procedural programming.
• It provides an environment for statistical computation and software
development.
• Provides extensive packages & libraries
• R has a wonderful community for people to share and learn from experts
• Numerous data sources to connect.

➢ R is built on top of the language S programming that was originally intended


as a programming language that would help the student learn to program while
playing around with data.
➢ R is a programming language developed by Ross Ihaka and Robert
Gentleman in 1993
➢ R is not only entrusted by academic, but many large companies also use R
programming language, including Uber, Google, Airbnb, Facebook.
➢ R is a case sensitive language

Applications of R Programming in Real World

• Data Science
• Statistical computing
• Machine Learning

Installation of R

Step 1 : Go to the link- https://cran.r-project.org/

Step 2 : Download and install R 3.3.3 on your system.

To download RStudio IDE, follow the below steps:

Step 1: Go to the link- https://www.rstudio.com/

Step 2: Download and install Rstudio on your system.


R Reserved Words

Reserved words in R programming are a set of words that have special meaning
and cannot be used as an identifier (variable name, function name etc.).
Here is a list of reserved words in the R’s parser.

Reserved words in R

if else repeat while function

for in next break TRUE

FALSE NULL Inf NaN NA

NA_integer_ NA_real_ NA_complex_ NA_character_ …

Variables in R
Variables are used to store data, whose value can be changed according to our
need. Unique name given to variable (function and objects as well) is identifier.

Rules for writing Identifiers in R

1. Identifiers can be a combination of letters, digits, period (.) and underscore (_).
2. It must start with a letter or a period. If it starts with a period, it cannot be followed
by a digit.
3. Reserved words in R cannot be used as identifiers.
Valid identifiers in R

total, Sum, .fine.with.dot, this_is_acceptable, Number5

Invalid identifiers in R

tot@l, 5um, _fine, TRUE, .0ne

Data structures are very important to understand because these are the
objects you will manipulate on a day-to-day basis in R
Everything in R is an object.
R has 6 basic data types

• character
• numeric (real or decimal)
• integer
• logical
• complex

• character: "a", ‘swc’


• numeric: 2, 15.5
• integer: 2L (the L tells R to store this as an integer)
• logical: TRUE, FALSE
• complex: 1+4i (complex numbers with real and imaginary parts)

Constants in R
Constants, as the name suggests, are entities whose value cannot be altered. Basic
types of constant are numeric constants and character constants.
Numeric Constants

All numbers fall under this category. They can be of


type integer, double or complex.
It can be checked with the typeof() function.
Numeric constants followed by L are regarded as integer and those followed
by i are regarded as complex.

> typeof(5)

[1] "double"

> typeof(5L)

[1] "integer"

> typeof(5i)

[1] "complex"

Character Constants
Character constants can be represented using either single quotes (') or double
quotes (") as delimiters.

> 'example'

[1] "example"

> typeof("5")

[1] "character"
Built-in Constants

Some of the built-in constants defined in R along with their values is shown below.

> LETTERS

[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
"S"

[20] "T" "U" "V" "W" "X" "Y" "Z"

> letters

[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"

[20] "t" "u" "v" "w" "x" "y" "z"

> pi

[1] 3.141593

> month.name

[1] "January" "February" "March" "April" "May" "June"

[7] "July" "August" "September" "October" "November" "December"

> month.abb

[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
R Operators

R has many operators to carry out different mathematical and logical operations.
Operators in R can mainly be classified into the following categories.

R Arithmetic Operators
These operators are used to carry out mathematical operations like addition and
multiplication. Here is a list of arithmetic operators available in R.
An example

> x <- 5

> y <- 16

> x+y

[1] 21

> x-y

[1] -11

> x*y

[1] 80

> y/x

[1] 3.2
> y%/%x

[1] 3

> y%%x

[1] 1

> y^x

[1] 1048576

R Relational Operators
Relational operators are used to compare between values. Here is a list of relational
operators available in R.

An example

> x <- 5

> y <- 16
> x<y

[1] TRUE

> x>y

[1] FALSE

> x<=5

[1] TRUE

> y>=20

[1] FALSE

> y == 16

[1] TRUE

> x != 5

[1] FALSE

R Logical Operators
Logical operators are used to carry out Boolean operations like AND, OR etc.
Logical Operators in R

Operator Description

! Logical NOT

& Element-wise logical AND

&& Logical AND

| Element-wise logical OR

|| Logical OR

Operators & and | perform element-wise operation producing result having length
of the longer operand.
But && and || examines only the first element of the operands resulting into a
single length logical vector.
Zero is considered FALSE and non-zero numbers are taken as TRUE. An example
run.

> x <- c(TRUE,FALSE,0,6)


> y <- c(FALSE,TRUE,FALSE,TRUE)

> !x

[1] FALSE TRUE TRUE FALSE

> x&y

[1] FALSE FALSE FALSE TRUE

> x&&y

[1] FALSE

> x|y

[1] TRUE TRUE FALSE TRUE

> x||y

[1] TRUE

R Assignment Operators
These operators are used to assign values to variables.
The operators <- and = can be used, almost interchangeably, to assign to variable
in the same environment.
The <<- operator is used for assigning to variables in the parent environments
(more like global assignments). The rightward assignments, although available are
rarely used.

> x <- 5

>x

[1] 5

>x=9

>x

[1] 9

> 10 -> x

>x
[1] 10

Operator Precedence

(2 + 6) * 5

[1] 40

Precedence and Associativity of different operators in R from highest to


lowest

Operator Precedence in R

Operator Description Associativity

^ Exponent Right to Left

-x, +x Unary minus, Unary plus Left to Right

%% Modulus Left to Right

*, / Multiplication, Division Left to Right

+, – Addition, Subtraction Left to Right

<, >, <=, >=, ==, != Comparisions Left to Right

! Logical NOT Left to Right


&, && Logical AND Left to Right

|, || Logical OR Left to Right

->, ->> Rightward assignment Left to Right

<-, <<- Leftward assignment Right to Left

= Leftward assignment Right to Left

R Program to Take Input From User

When we are working with R in an interactive session, we can


use readline() function to take input from the user (terminal).
This function will return a single element character vector.
So, if we want numbers, we need to do appropriate conversions.

Example: Take input from user

my.name <- readline(prompt="Enter name: ")

my.age <- readline(prompt="Enter age: ")

# convert character into integer

my.age <- as.integer(my.age)


print(paste("Hi,", my.name, "next year you will be", my.age+1, "years old."))

Output

Enter name: Mary

Enter age: 17

[1] "Hi, Mary next year you will be 18 years old."

R provides many functions to examine features of vectors and other objects, for
example

• class() - what kind of object is it (high-level)?


• typeof() - what is the object’s data type (low-level)?
• length() - how long is it? What about two dimensional objects?
• attributes() - does it have any metadata?

Objects Attributes

Objects can have attributes. Attributes are part of the object. These include:

• names
• dimnames
• dim
• class
• attributes (contain metadata)
R DATA STRUCTURES

• R Vectors
• R Matrix
• R array
• List in R
• R Data Frame
• R Factor

R Vector

Vector is a basic data structure in R. It contains element of the same type. The
data types can be logical, integer, double, character, complex.
A vector’s type can be checked with the typeof() function.
Another important property of a vector is its length. This is the number of elements
in the vector and can be checked with the function length().
Vectors are the most basic R data objects and there are six types of atomic vectors.
Below are the six atomic vectors:
How to Create Vector in R?
Vectors are generally created using the c() function
Since, a vector must have elements of the same type, this function will try and
coerce elements to the same type, if they are different.
Coercion is from lower to higher types from logical to integer to double to
character.

> x <- c(1, 5, 4, 9, 0)


> typeof(x)

[1] "double"

> length(x)

[1] 5

> x <- c(1, 5.4, TRUE, "hello")

>x

[1] "1" "5.4" "TRUE" "hello"

> typeof(x)

[1] "character"

Example 1: Creating a vector using : operator

> x <- 1:7; x

[1] 1 2 3 4 5 6 7

> y <- 2:-2; y

[1] 2 1 0 -1 -2

Example 2: Creating a vector using seq() function

> seq(1, 3, by=0.2) # specify step size

[1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
> seq(1, 5, length.out=4) # specify length of the vector

[1] 1.000000 2.333333 3.666667 5.000000

How to access Elements of a Vector?


Elements of a vector can be accessed using vector indexing. The vector used for
indexing can be logical, integer or character vector.

Using integer vector as index

Vector index in R starts from 1, unlike most programming languages where index
start from 0.
We can use a vector of integers as index to access specific elements.
We can also use negative integers to return all elements except that those specified.
But we cannot mix positive and negative integers while indexing and real numbers,
if used, are truncated to integers.

>x

[1] 0 2 4 6 8 10

> x[3] # access 3rd element

[1] 4

> x[c(2, 4)] # access 2nd and 4th element

[1] 2 6

> x[-1] # access all but 1st element

[1] 2 4 6 8 10
> x[c(2, -4)] # cannot mix positive and negative integers

Error in x[c(2, -4)] : only 0's may be mixed with negative subscripts

> x[c(2.4, 3.54)] # real numbers are truncated to integers

[1] 2 4

Using logical vector as index


When we use a logical vector for indexing, the position where the logical vector
is TRUE is returned.

Using character vector as index


This type of indexing is useful when dealing with named vectors. We can name
each elements of a vector.

> x <- c("first"=3, "second"=0, "third"=9)

> names(x)

[1] "first" "second" "third"

> x["second"]

second

> x[c("first", "third")]


first third

3 9

How to modify a vector in R?


We can modify a vector using the assignment operator.
We can use the techniques discussed above to access specific elements and modify
them.
If we want to truncate the elements, we can use reassignments.

>x

[1] -3 -2 -1 0 1 2

> x[2] <- 0; x # modify 2nd element

[1] -3 0 -1 0 1 2

> x[x<0] <- 5; x # modify elements less than 0

[1] 5 0 5 0 1 2

> x <- x[1:4]; x # truncate x to first 4 elements

[1] 5 0 5 0
How to delete a Vector?
We can delete a vector by simply assigning a NULL to it.

>x

[1] -3 -2 -1 0 1 2

> x <- NULL

>x

NULL

> x[4]

NULL

Missing Data

R supports missing data in vectors. They are represented as NA (Not Available)


and can be used for all the vector types covered in this lesson:
x <- c(0.5, NA, 0.7)
x <- c(TRUE, FALSE, NA)
x <- c("a", NA, "c", "d", "e")
x <- c(1+5i, 2-3i, NA)
The function is.na() indicates the elements of the vectors that represent missing
data, and the function anyNA() returns TRUE if the vector contains any missing
values:
x <- c("a", NA, "c", "d", NA)
y <- c("a", "b", "c", "d", "e")
is.na(x)
[1] FALSE TRUE FALSE FALSE TRUE
is.na(y)
[1] FALSE FALSE FALSE FALSE FALSE
anyNA(x)
[1] TRUE
anyNA(y)
[1] FALSE

Example: Vector Elements Arithmetic

> sum(2,7,5)

[1] 14

>x

[1] 2 NA 3 1 4

> sum(x) # if any element is NA or NaN, result is NA or NaN

[1] NA

> sum(x, na.rm=TRUE) # this way we can ignore NA and NaN values

[1] 10

> mean(x, na.rm=TRUE)

[1] 2.5

> prod(x, na.rm=TRUE)

[1] 24
Other Special Values

Inf is infinity. You can have either positive or negative infinity.


1/0
[1] Inf
NaN means Not a Number. It’s an undefined value.
0/0
[1] NaN

R Matrix

Matrix is a two dimensional data structure in R programming.


Matrix is similar to vector but additionally contains the dimension attribute. All
attributes of an object can be checked with the attributes() function (dimension can
be checked directly with the dim() function).
We can check if a variable is a matrix or not with the class() function.

>a

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> class(a)

[1] "matrix"
> attributes(a)

$dim

[1] 3 3

> dim(a)

[1] 3 3

How to create a matrix in R programming?


Matrix can be created using the matrix() function.
Dimension of the matrix can be defined by passing appropriate value for
arguments nrow and ncol.
Providing value for both dimension is not necessary. If one of the dimension is
provided, the other is inferred from length of the data.

> matrix(1:9, nrow = 3, ncol = 3)

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> # same result is obtained by providing only one dimension

> matrix(1:9, nrow = 3)


[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

We can see that the matrix is filled column-wise. This can be reversed to row-wise
filling by passing TRUE to the argument byrow.

> matrix(1:9, nrow=3, byrow=TRUE) # fill matrix row-wise

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

[3,] 7 8 9

In all cases, however, a matrix is stored in column-major order internally as we


will see in the subsequent sections.
It is possible to name the rows and columns of matrix during creation by passing a
2 element list to the argument dimnames.

> x <- matrix(1:9, nrow = 3, dimnames = list(c("X","Y","Z"), c("A","B","C")))

>x

ABC

X147
Y258

Z369

These names can be accessed or changed with two helpful


functions colnames() and rownames().

> colnames(x)

[1] "A" "B" "C"

> rownames(x)

[1] "X" "Y" "Z"

> # It is also possible to change names

> colnames(x) <- c("C1","C2","C3")

> rownames(x) <- c("R1","R2","R3")

>x

C1 C2 C3

R1 1 4 7

R2 2 5 8

R3 3 6 9

Another way of creating a matrix is by using functions cbind() and rbind() as


in column bind and row bind.
> cbind(c(1,2,3),c(4,5,6))

[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

> rbind(c(1,2,3),c(4,5,6))

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

Finally, you can also create a matrix from a vector by setting its dimension
using dim().

> x <- c(1,2,3,4,5,6)

>x

[1] 1 2 3 4 5 6

> class(x)

[1] "numeric"

> dim(x) <- c(2,3)


>x

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

> class(x)

[1] "matrix"

How to access Elements of a matrix?


We can access elements of a matrix using the square bracket [ indexing method.
Elements can be accessed as var[row, column]. Here rows and columns are
vectors.
Using integer vector as index

We specify the row numbers and column numbers as vectors and use it for
indexing.
If any field inside the bracket is left blank, it selects all.
We can use negative integers to specify rows or columns to be excluded.

>x

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8
[3,] 3 6 9

> x[c(1,2),c(2,3)] # select rows 1 & 2 and columns 2 & 3

[,1] [,2]

[1,] 4 7

[2,] 5 8

> x[c(3,2),] # leaving column field blank will select entire columns

[,1] [,2] [,3]

[1,] 3 6 9

[2,] 2 5 8

> x[,] # leaving row as well as column field blank will select entire matrix

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> x[-1,] # select all rows except first

[,1] [,2] [,3]

[1,] 2 5 8
[2,] 3 6 9

One thing to notice here is that, if the matrix returned after indexing is a row
matrix or column matrix, the result is given as a vector.

> x[1,]

[1] 1 4 7

> class(x[1,])

[1] "integer"

This behavior can be avoided by using the argument drop = FALSE while
indexing.

> x[1,,drop=FALSE] # now the result is a 1X3 matrix rather than a vector

[,1] [,2] [,3]

[1,] 1 4 7

> class(x[1,,drop=FALSE])

[1] "matrix"

It is possible to index a matrix with a single vector.


While indexing in such a way, it acts like a vector formed by stacking columns of
the matrix one after another. The result is returned as a vector.

>x

[,1] [,2] [,3]


[1,] 4 8 3

[2,] 6 0 7

[3,] 1 2 9

> x[1:4]

[1] 4 6 1 8

> x[c(3,5,7)]

[1] 1 0 3

Using logical vector as index

Two logical vectors can be used to index a matrix. In such situation, rows and
columns where the value is TRUE is returned. These indexing vectors are recycled
if necessary and can be mixed with integer vectors.

>x

[,1] [,2] [,3]

[1,] 4 8 3

[2,] 6 0 7

[3,] 1 2 9

> x[c(TRUE,FALSE,TRUE),c(TRUE,TRUE,FALSE)]
[,1] [,2]

[1,] 4 8

[2,] 1 2

> x[c(TRUE,FALSE),c(2,3)] # the 2 element logical vector is recycled to 3


element vector

[,1] [,2]

[1,] 8 3

[2,] 2 9

It is also possible to index using a single logical vector where recycling takes place
if necessary.

> x[c(TRUE, FALSE)]

[1] 4 1 0 3 9

In the above example, the matrix x is treated as vector formed by stacking columns
of the matrix one after another, i.e., (4,6,1,8,0,2,3,7,9).
The indexing logical vector is also recycled and thus alternating elements are
selected. This property is utilized for filtering of matrix elements as shown below.

> x[x>5] # select elements greater than 5

[1] 6 8 7 9

> x[x%%2 == 0] # select even elements


[1] 4 6 8 0 2

Using character vector as index

Indexing with character vector is possible for matrix with named row or column.
This can be mixed with integer or logical indexing.

>x

ABC

[1,] 4 8 3

[2,] 6 0 7

[3,] 1 2 9

> x[,"A"]

[1] 4 6 1

> x[TRUE,c("A","C")]

AC

[1,] 4 3

[2,] 6 7

[3,] 1 9

> x[2:3,c("A","C")]
AC

[1,] 6 7

[2,] 1 9

How to modify a matrix in R?


We can combine assignment operator with the above learned methods for
accessing elements of a matrix to modify it.

>x

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 5 8

[3,] 3 6 9

> x[2,2] <- 10; x # modify a single element

[,1] [,2] [,3]

[1,] 1 4 7

[2,] 2 10 8

[3,] 3 6 9

> x[x<5] <- 0; x # modify elements less than 5


[,1] [,2] [,3]

[1,] 0 0 7

[2,] 0 10 8

[3,] 0 6 9

A common operation with matrix is to transpose it. This can be done with the
function t().

> t(x) # transpose a matrix

[,1] [,2] [,3]

[1,] 0 0 0

[2,] 0 10 6

[3,] 7 8 9

We can add row or column using rbind() and cbind() function respectively.
Similarly, it can be removed through reassignment.

> cbind(x, c(1, 2, 3)) # add column

[,1] [,2] [,3] [,4]

[1,] 0 0 7 1

[2,] 0 10 8 2

[3,] 0 6 9 3
> rbind(x,c(1,2,3)) # add row

[,1] [,2] [,3]

[1,] 0 0 7

[2,] 0 10 8

[3,] 0 6 9

[4,] 1 2 3

> x <- x[1:2,]; x # remove last row

[,1] [,2] [,3]

[1,] 0 0 7

[2,] 0 10 8

Dimension of matrix can be modified as well, using the dim() function.

>x

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

> dim(x) <- c(3,2); x # change to 3X2 matrix

[,1] [,2]
[1,] 1 4

[2,] 2 5

[3,] 3 6

> dim(x) <- c(1,6); x # change to 1X6 matrix

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 1 2 3 4 5 6

Arrays:

Arrays are the R data objects which can store data in more than two dimensions. It
takes vectors as input and uses the values in the dim parameter to create an array.

R Array Syntax
Array_NAME <- array(data, dim = (row_Size, column_Size,
matrices, dimnames)

vector1 <- c(5,9,3)

vector2 <- c(10,11,12,13,14,15)

result <- array(c(vector1,vector2),dim = c(3,3,2))


Output –

,,1
[,1] [,2] [,3]

[1,] 5 10 13

[2,] 9 11 14

[3,] 3 12 15
,,2
[,1] [,2] [,3]

[1,] 5 10 13

[2,] 9 11 14

[3,] 3 12 15

R Lists

List is a data structure having components of mixed data types.


A vector having all elements of the same type is called atomic vector but a vector
having elements of different type is called list.
We can check if it’s a list with typeof() function and find its length using length().
Here is an example of a list having three components each of different data type.

>x

$a

[1] 2.5

$b

[1] TRUE

$c

[1] 1 2 3

> typeof(x)

[1] "list"
> length(x)

[1] 3

How to create a list in R programming?


List can be created using the list() function.

> x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)

Here, we create a list x, of three components with data


types double, logical and integer vector respectively.
Its structure can be examined with the str() function.

> str(x)

List of 3

$ a: num 2.5

$ b: logi TRUE

$ c: int [1:3] 1 2 3

In this example, a, b and c are called tags which makes it easier to reference the
components of the list.
However, tags are optional. We can create the same list without the tags as follows.
In such scenario, numeric indices are used by default.

> x <- list(2.5,TRUE,1:3)

>x
[[1]]

[1] 2.5

[[2]]

[1] TRUE

[[3]]

[1] 1 2 3

How to access components of a list?


Lists can be accessed in similar fashion to vectors. Integer, logical or character
vectors can be used for indexing. Let us consider a list as follows.

>x

$name

[1] "John"

$age

[1] 19

$speaks

[1] "English" "French"

> x[c(1:2)] # index using integer vector


$name

[1] "John"

$age

[1] 19

> x[-2] # using negative integer to exclude second component

$name

[1] "John"

$speaks

[1] "English" "French"

> x[c(T,F,F)] # index using logical vector

$name

[1] "John"

> x[c("age","speaks")] # index using character vector

$age

[1] 19

$speaks

[1] "English" "French"


Indexing with [ as shown above will give us sublist not the content inside the
component. To retrieve the content, we need to use [[.
However, this approach will allow us to access only a single component at a time.

> x["age"]

$age

[1] 19

> typeof(x["age"]) # single [ returns a list

[1] "list"

> x[["age"]] # double [[ returns the content

[1] 19

> typeof(x[["age"]])

[1] "double"

An alternative to [[, which is used often while accessing content of a list is


the $ operator. They are both the same except that $ can do partial matching on
tags.

> x$name # same as x[["name"]]

[1] "John"

> x$a # partial matching, same as x$ag or x$age

[1] 19
> x[["a"]] # cannot do partial match with [[

NULL

> # indexing can be done recursively

> x$speaks[1]

[1] "English"

> x[["speaks"]][2]

[1] "French"

How to modify a list in R?


We can change components of a list through reassignment. We can choose any of
the component accessing techniques discussed above to modify it.
Notice below that modification causes reordering of components.

> x[["name"]] <- "Clair"; x

$age

[1] 19

$speaks

[1] "English" "French"

$name
[1] "Clair"

How to add components to a list?

Adding new components is easy. We simply assign values using new tags and it
will pop into action.

> x[["married"]] <- FALSE

>x

$age

[1] 19

$speaks

[1] "English" "French"

$name

[1] "Clair"

$married

[1] FALSE

How to delete components from a list?

We can delete a component by assigning NULL to it.


> x[["age"]] <- NULL

> str(x)

List of 3

$ speaks : chr [1:2] "English" "French"

$ name : chr "Clair"

$ married: logi FALSE

> x$married <- NULL

> str(x)

List of 2

$ speaks: chr [1:2] "English" "French"

$ name : chr "Clair"

R Data Frame

• Data frame is a two dimensional data structure in R. It is a special case of


a list which has each component of equal length.
• Each component form the column and contents of the component form the
rows.

Useful Data Frame Functions

• head() - shows first 6 rows


• tail() - shows last 6 rows
• dim() - returns the dimensions of data frame (i.e. number of rows and
number of columns)
• nrow() - number of rows
• ncol() - number of columns
• str() - structure of data frame - name, type and preview of data in each
column
• names() or colnames() - both show the names attribute for a data frame
• sapply(dataframe, class) - shows the class of each column in the data frame

• How to create a Data Frame in R?


• We can create a data frame using the data.frame() function.
• For example, the above shown data frame can be created as follows.

• > x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" =


c("John","Dora"))

• > str(x) # structure of x

• 'data.frame': 2 obs. of 3 variables:

• $ SN : int 1 2

• $ Age : num 21 15

• $ Name: Factor w/ 2 levels "Dora","John": 2 1

• Notice above that the third column, Name is of type factor, instead of a
character vector.
• By default, data.frame() function converts character vector into factor.
• To suppress this behavior, we can pass the
argument stringsAsFactors=FALSE.

• > x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John",


"Dora"), stringsAsFactors = FALSE)

• > str(x) # now the third column is a character vector

• 'data.frame': 2 obs. of 3 variables:

• $ SN : int 1 2

• $ Age : num 21 15

• $ Name: chr "John" "Dora"

Check if a variable is a data frame or not

• We can check if a variable is a data frame or not using the class() function.

• >x

• SN Age Name

• 1 1 21 John

• 2 2 15 Dora

• > typeof(x) # data frame is a special case of list

• [1] "list"

• > class(x)
• [1] "data.frame"

• In this example, x can be considered as a list of 3 components with each


component having a two element vector. Some useful functions to know
more about a data frame are given below.

• Functions of data frame

• > names(x)

• [1] "SN" "Age" "Name"

• > ncol(x)

• [1] 3

• > nrow(x)

• [1] 2

• > length(x) # returns length of the list, same as ncol()

• [1] 3

• How to access Components of a Data Frame?


• Components of data frame can be accessed like a list or like a matrix.
• Accessing like a list

• We can use either [, [[ or $ operator to access columns of data frame.

• > x["Name"]

• Name

• 1 John

• 2 Dora

• > x$Name

• [1] "John" "Dora"

• > x[["Name"]]

• [1] "John" "Dora"

• > x[[3]]

• [1] "John" "Dora"

• Accessing with [[ or $ is similar. However, it differs for [ in that, indexing


with [ will return us a data frame but the other two will reduce it into a
vector.

• Accessing like a matrix

• Data frames can be accessed like a matrix by providing index for row and
column.
• To illustrate this, we use datasets already available in R. Datasets that are
available can be listed with the command library(help = "datasets").
• We will use the trees dataset which contains Girth, Height and Volume for
Black Cherry Trees.
• A data frame can be examined using functions like str() and head().

• > str(trees)

• 'data.frame': 31 obs. of 3 variables:

• $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...

• $ Height: num 70 65 63 72 81 83 66 75 80 75 ...

• $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

• > head(trees,n=3)

• Girth Height Volume

• 1 8.3 70 10.3

• 2 8.6 65 10.3

• 3 8.8 63 10.2

• We can see that trees is a data frame with 31 rows and 3 columns. We also
display the first 3 rows of the data frame.
• Now we proceed to access the data frame like a matrix.

• > trees[2:3,] # select 2nd and 3rd row

• Girth Height Volume

• 2 8.6 65 10.3
• 3 8.8 63 10.2

• > trees[trees$Height > 82,] # selects rows with Height greater than 82

• Girth Height Volume

• 6 10.8 83 19.7

• 17 12.9 85 33.8

• 18 13.3 86 27.4

• 31 20.6 87 77.0

• > trees[10:12,2]

• [1] 75 79 76

• We can see in the last case that the returned type is a vector since we
extracted data from a single column.
• This behavior can be avoided by passing the argument drop=FALSE as
follows.

• > trees[10:12,2, drop = FALSE]

• Height

• 10 75

• 11 79

• 12 76
• How to modify a Data Frame in R?
• Data frames can be modified like we modified matrices through
reassignment.

• >x

• SN Age Name

• 1 1 21 John

• 2 2 15 Dora

• > x[1,"Age"] <- 20; x

• SN Age Name

• 1 1 20 John

• 2 2 15 Dora

• Adding Components

• Rows can be added to a data frame using the rbind() function.

• > rbind(x,list(1,16,"Paul"))

• SN Age Name

• 1 1 20 John

• 2 2 15 Dora
• 3 1 16 Paul

• Similarly, we can add columns using cbind().

• > cbind(x,State=c("NY","FL"))

• SN Age Name State

• 1 1 20 John NY

• 2 2 15 Dora FL

• Since data frames are implemented as list, we can also add new columns
through simple list-like assignments.

• >x

• SN Age Name

• 1 1 20 John

• 2 2 15 Dora

• > x$State <- c("NY","FL"); x

• SN Age Name State

• 1 1 20 John NY

• 2 2 15 Dora FL


• Deleting Component

• Data frame columns can be deleted by assigning NULL to it.

• > x$State <- NULL

• >x

• SN Age Name

• 1 1 20 John

• 2 2 15 Dora

• Similarly, rows can be deleted through reassignments.

• > x <- x[-1,]

• >x

• SN Age Name

• 2 2 15 Dora

R Factors

Factor is a data structure used for fields that takes only predefined, finite number of
values (categorical data). For example: a data field such as marital status may
contain only values from single, married, separated, divorced, or widowed.
In such case, we know the possible values beforehand and these predefined,
distinct values are called levels. Following is an example of factor in R.

>x

[1] single married married single


Levels: married single

Here, we can see that factor x has four elements and two levels. We can check if a
variable is a factor or not using class() function.
Similarly, levels of a factor can be checked using the levels() function.

> class(x)

[1] "factor"

> levels(x)

[1] "married" "single"

How to create a factor in R?


We can create a factor using the function factor(). Levels of a factor are inferred
from the data if not provided.

> x <- factor(c("single", "married", "married", "single"))

>x

[1] single married married single

Levels: married single

> x <- factor(c("single", "married", "married", "single"), levels = c("single",


"married", "divorced"));

>x
[1] single married married single

Levels: single married divorced

We can see from the above example that levels may be predefined even if not used.
Factors are closely related with vectors. In fact, factors are stored as integer
vectors. This is clearly seen from its structure.

> x <- factor(c("single","married","married","single"))

> str(x)

Factor w/ 2 levels "married","single": 2 1 1 2

We see that levels are stored in a character vector and the individual elements are
actually stored as indices.
Factors are also created when we read non-numerical columns into a data frame.
By default, data.frame() function converts character vector into factor. To suppress
this behavior, we have to pass the argument stringsAsFactors = FALSE.

How to access compoments of a factor?


Accessing components of a factor is very much similar to that of vectors.

>x

[1] single married married single

Levels: married single

> x[3] # access 3rd element


[1] married

Levels: married single

> x[c(2, 4)] # access 2nd and 4th element

[1] married single

Levels: married single

> x[-1] # access all but 1st element

[1] married married single

Levels: married single

> x[c(TRUE, FALSE, FALSE, TRUE)] # using logical vector

[1] single single

Levels: married single

How to modify a factor?


Components of a factor can be modified using simple assignments. However, we
cannot choose values outside of its predefined levels.

>x

[1] single married married single

Levels: single married divorced


> x[3] <- "widowed" # cannot assign values outside levels

Warning message:

In `[<-.factor`(`*tmp*`, 3, value = "widowed") :

invalid factor level, NA generated

>x

[1] single divorced <NA> single

Levels: single married divorced

A workaround to this is to add the value to the level first.

> levels(x) <- c(levels(x), "widowed") # add new level

> x[3] <- "widowed"

>x

[1] single married widowed single

Levels: single married widowed


R Flow Control

R if statement
The syntax of if statement is:

if (test_expression) {

statement

If the test_expression is TRUE, the statement gets executed. But if it’s FALSE,
nothing happens.
Here, test_expression can be a logical or numeric vector, but only the first element
is taken into consideration.
In the case of numeric vector, zero is taken as FALSE, rest as TRUE.
Flowchart of if statement

Example: if statement

x <- 5

if(x > 0){

print("Positive number")

Output
[1] "Positive number"

if…else statement
The syntax of if…else statement is:

if (test_expression) {

statement1

} else {

statement2

The else part is optional and is only evaluated if test_expression is FALSE.


It is important to note that else must be in the same line as the closing braces of the
if statement.
Flowchart of if…else statement
Example of if…else statement

x <- -5

if(x > 0){

print("Non-negative number")

} else {

print("Negative number")

Output

[1] "Negative number"

The above conditional can also be written in a single line as follows.

if(x > 0) print("Non-negative number") else print("Negative number")


This feature of R allows us to write construct as shown below.

> x <- -5

> y <- if(x > 0) 5 else 6

>y

[1] 6

if…else Ladder
The if…else ladder (if…else…if) statement allows you execute a block of code
among more than 2 alternatives
The syntax of if…else statement is:

if ( test_expression1) {

statement1

} else if ( test_expression2) {

statement2

} else if ( test_expression3) {

statement3

} else {

statement4
}

Only one statement will get executed depending upon the test_expressions.
Example of nested if…else

x <- 0

if (x < 0) {

print("Negative number")

} else if (x > 0) {

print("Positive number")

} else

print("Zero")
Output

[1] "Zero"

R ifelse() Function

Syntax of ifelse() function

ifelse(test_expression, x, y)

Here, test_expression must be a logical vector (or an object that can be coerced to
logical). The return value is a vector with the same length as test_expression.
This returned vector has element from x if the corresponding value
of test_expression is TRUE or from y if the corresponding value
of test_expression is FALSE.
This is to say, the i-th element of result will
be x[i] if test_expression[i] is TRUE else it will take the value of y[i].
The vectors x and y are recycled whenever necessary.

Example: ifelse() function

> a = c(5,7,2,9)

> ifelse(a %% 2 == 0,"even","odd")

[1] "odd" "odd" "even" "odd"

In the above example, the test_expression is a %% 2 == 0 which will result into the
vector (FALSE,FALSE,TRUE ,FALSE).
Similarly, the other two vectors in the function argument gets recycled
to ("even","even","even","even") and ("odd","odd","odd","odd") respectively.
And hence the result is evaluated accordingly.

Switch Statements:

• These control statements are basically used to compare a certain expression


to a known value. Refer to the below flowchart to get a better understanding:

In this Switch case flowchart, the code


will respond in the following steps:

1. First of all it will enter the switch case which has an expression.
2. Next it will go to Case 1 condition, checks the value passed to the condition.
If it is true, Statement block will execute. After that, it will break from that
switch case.
3. In case it is false, then it will switch to the next case. If Case 2 condition is
true, it will execute the statement and break from that case, else it will again
jump to the next case.
4. Now let’s say you have not specified any case or there is some wrong input
from the user, then it will go to the default case where it will print your default
statement.

Below is an example of switch statement in R. Try running this example in R Studio.

vtr <- c(150,200,250,300,350,400)


option <-"mean"
switch(option,
"mean" = print(mean(vtr)),
"mode" = print(mode((vtr))),
"median" = print(median((vtr)))
)

Output :

[1] 275

For loop:
A for loop is used to iterate over a vector in R programming.

Syntax of for loop

for (val in sequence)

statement
}

Here, sequence is a vector and val takes on each of its value during the loop. In each
iteration, statement is evaluated.

Flowchart of for loop

Example: for loop


Below is an example to count the number of even numbers in a vector.

x <- c(2,5,3,9,8,11,6)

count <- 0

for (val in x) {

if(val %% 2 == 0) count = count+1

}
print(count)

Output

[1] 3

Example: Find the factorial of a number

# take input from the user

num = as.integer(readline(prompt="Enter a number: "))

factorial = 1

# check is the number is negative, positive or zero

if(num < 0) {

print("Sorry, factorial does not exist for negative numbers")

} else if(num == 0) {

print("The factorial of 0 is 1")

} else {

for(i in 1:num) {

factorial = factorial * i

print(paste("The factorial of", num ,"is",factorial))


}

Output

Enter a number: 8

[1] "The factorial of 8 is 40320"

Example: Multiplication Table

# R Program to find the multiplicationtable (from 1 to 10)

# take input from the user

num = as.integer(readline(prompt = "Enter a number: "))

# use for loop to iterate 10 times

for(i in 1:10) {

print(paste(num,'x', i, '=', num*i))

Output

Enter a number: 7

[1] "7 x 1 = 7"

[1] "7 x 2 = 14"

[1] "7 x 3 = 21"


[1] "7 x 4 = 28"

[1] "7 x 5 = 35"

[1] "7 x 6 = 42"

[1] "7 x 7 = 49"

[1] "7 x 8 = 56"

[1] "7 x 9 = 63"

[1] "7 x 10 = 70"

Example: Check Prime Number

# Program to check if the input number is prime or not

# take input from the user

num = as.integer(readline(prompt="Enter a number: "))

flag = 0

# prime numbers are greater than 1

if(num > 1) {

# check for factors

flag = 1

for(i in 2:(num-1)) {
if ((num %% i) == 0) {

flag = 0

break

if(num == 2) flag = 1

if(flag == 1) {

print(paste(num,"is a prime number"))

} else {

print(paste(num,"is not a prime number"))

Output 1

Enter a number: 25

[1] "25 is not a prime number"

Output 2

Enter a number: 19
[1] "19 is a prime number"

R while Loop

Loops are used in programming to repeat a specific block of code. In this article,
you will learn to create a while loop in R programming.

In R programming, while loops are used to loop until a specific condition is met.

Syntax of while loop

while (test_expression)

statement

Here, test_expression is evaluated and the body of the loop is entered if the result
is TRUE.
The statements inside the loop are executed and the flow returns to evaluate
the test_expression again.
This is repeated each time until test_expression evaluates to FALSE, in which case,
the loop exits.

Flowchart of while Loop


Example of while Loop

i <- 1

while (i < 6) {

print(i)

i = i+1

Output

[1] 1

[1] 2

[1] 3
[1] 4

[1] 5

Example: Check Armstrong number

# take input from the user

num = as.integer(readline(prompt="Enter a number: "))

# initialize sum

sum = 0

# find the sum of the cube of each digit

temp = num

while(temp > 0) {

digit = temp %% 10

sum = sum + (digit ^ 3)

temp = floor(temp / 10)

# display the result

if(num == sum) {

print(paste(num, "is an Armstrong number"))


} else {

print(paste(num, "is not an Armstrong number"))

Output 1

Enter a number: 23

[1] "23 is not an Armstrong number"

Output 2

Enter a number: 370

[1] "370 is an Armstrong number"

Example: Print Fibonacci Sequence

# take input from the user

nterms = as.integer(readline(prompt="How many terms? "))

# first two terms

n1 = 0

n2 = 1

count = 2

# check if the number of terms is valid


if(nterms <= 0) {

print("Plese enter a positive integer")

} else {

if(nterms == 1) {

print("Fibonacci sequence:")

print(n1)

} else {

print("Fibonacci sequence:")

print(n1)

print(n2)

while(count < nterms) {

nth = n1 + n2

print(nth)

# update values

n1 = n2

n2 = nth

count = count + 1
}

Output

How many terms? 7

[1] "Fibonacci sequence:"

[1] 0

[1] 1

[1] 1

[1] 2

[1] 3

[1] 5

[1] 8

Example 1: Find sum of natural numbers without formula

# take input from the user

num = as.integer(readline(prompt = "Enter a number: "))

if(num < 0) {
print("Enter a positive number")

} else {

sum = 0

# use while loop to iterate until zero

while(num > 0) {

sum = sum + num

num = num - 1

print(paste("The sum is", sum))

Output

Enter a number: 10

[1] "The sum is 55"

R repeat loop

A repeat loop is used to iterate over a block of code multiple number of times.
There is no condition check in repeat loop to exit the loop.
We must ourselves put a condition explicitly inside the body of the loop and use
the break statement to exit the loop. Failing to do so will result into an infinite loop.
Syntax of repeat loop

repeat {

statement

In the statement block, we must use the break statement to exit the loop.

Flowchart of repeat loop

Example: repeat loop

x <- 1

repeat {
print(x)

x = x+1

if (x == 6){

break

Output

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5

Control statements

R has the following control statements,


R break and next Statement

In R programming, a normal looping sequence can be altered using the break or


the next statement.

break statement
A break statement is used inside a loop (repeat, for, while) to stop the iterations and
flow the control outside of the loop.
In a nested looping situation, where there is a loop inside another loop, this statement
exits from the innermost loop that is being evaluated.

The syntax of break statement is:

if (test_expression) {

break

Note: the break statement can also be used inside the else branch
of if...else statement.
Flowchart of break statement

Example 1: break statement

x <- 1:5

for (val in x) {

if (val == 3){

break

print(val)
}

Output

[1] 1

[1] 2

In this example, we iterate over the vector x, which has consecutive numbers from
1 to 5.
Inside the for loop we have used a if condition to break if the current value is equal
to 3.
As we can see from the output, the loop terminates when it encounters
the break statement.

next statement
A next statement is useful when we want to skip the current iteration of a loop
without terminating it. On encountering next, the R parser skips further evaluation
and starts next iteration of the loop.

The syntax of next statement is:

if (test_condition) {

next

Note: the next statement can also be used inside the else branch
of if...else statement.
Flowchart of next statement

Example 2: Next statement

x <- 1:5

for (val in x) {

if (val == 3){

next

print(val)
}

Output

[1] 1

[1] 2

[1] 4

[1] 5

R Functions

Functions are used to logically break our code into simpler parts which become easy
to maintain and understand.
It’s pretty straightforward to create your own function in R programming.

Syntax for Writing Functions in R


func_name <- function (argument) {

statement

• Here, we can see that the reserved word function is used to declare a function in R.
• The statements within the curly braces form the body of the function. These braces
are optional if the body contains only a single expression.
• Finally, this function object is given a name by assigning it to a variable, func_name.

Example of a Function

pow <- function(x, y) {

# function to print x raised to the power y

result <- x^y

print(paste(x,"raised to the power", y, "is", result))

Here, we created a function called pow().


It takes two arguments, finds the first argument raised to the power of second
argument and prints the result in appropriate format.
We have used a built-in function paste() which is used to concatenate strings.

How to call a function?


We can call the above function as follows.

>pow(8, 2)

[1] "8 raised to the power 2 is 64"

> pow(2, 8)

[1] "2 raised to the power 8 is 256"

Here, the arguments used in the function declaration (x and y) are called formal
arguments and those used while calling the function are called actual arguments.

Named Arguments
In the above function calls, the argument matching of formal argument to the actual
arguments takes place in positional order.
This means that, in the call pow(8,2), the formal arguments x and y are assigned 8
and 2 respectively.
We can also call the function using named arguments.
When calling a function in this way, the order of the actual arguments doesn’t matter.
For example, all of the function calls given below are equivalent.

> pow(8, 2)

[1] "8 raised to the power 2 is 64"

> pow(x = 8, y = 2)

[1] "8 raised to the power 2 is 64"

> pow(y = 2, x = 8)
[1] "8 raised to the power 2 is 64"

Furthermore, we can use named and unnamed arguments in a single call.


In such case, all the named arguments are matched first and then the remaining
unnamed arguments are matched in a positional order.

> pow(x=8, 2)

[1] "8 raised to the power 2 is 64"

> pow(2, x=8)

[1] "8 raised to the power 2 is 64"

In all the examples above, x gets the value 8 and y gets the value 2.

Default Values for Arguments


We can assign default values to arguments in a function in R.
This is done by providing an appropriate value to the formal argument in the function
declaration.
Here is the above function with a default value for y.

pow <- function(x, y = 2) {

# function to print x raised to the power y

result <- x^y

print(paste(x,"raised to the power", y, "is", result))


}

The use of default value to an argument makes it optional when calling the function.

> pow(3)

[1] "3 raised to the power 2 is 9"

> pow(3,1)

[1] "3 raised to the power 1 is 3"

Here, y is optional and will take the value 2 when not provided.
R Return Value from Function

Many a times, we will require our functions to do some processing and return back
the result. This is accomplished with the return() function in R.

Syntax of return()

return(expression)

The value returned from a function can be any valid object.

Example: return()
Let us look at an example which will return whether a given number is positive,
negative or zero.

check <- function(x) {


if (x > 0) {

result <- "Positive"

else if (x < 0) {

result <- "Negative"

else {

result <- "Zero"

return(result)

Here, are some sample runs.

> check(1)

[1] "Positive"

> check(-10)

[1] "Negative"

> check(0)
[1] "Zero"

Functions without return()


If there are no explicit returns from a function, the value of the last evaluated
expression is returned automatically in R.
For example, the following is equivalent to the above function.

check <- function(x) {

if (x > 0) {

result <- "Positive"

else if (x < 0) {

result <- "Negative"

else {

result <- "Zero"

result

}
We generally use explicit return() functions to return a value immediately from a
function.
If it is not the last statement of the function, it will prematurely end the function
bringing the control to the place from which it was called.

check <- function(x) {

if (x>0) {

return("Positive")

else if (x<0) {

return("Negative")

else {

return("Zero")

In the above example, if x > 0, the function immediately returns "Positive" without
evaluating rest of the body.

Multiple Returns
The return() function can return only a single object. If we want to return multiple
values in R, we can use a list (or other objects) and return it.
Following is an example.

multi_return <- function() {

my_list <- list("color" = "red", "size" = 20, "shape" = "round")

return(my_list)

Here, we create a list my_list with multiple elements and return this single list.

> a <- multi_return()

> a$color

[1] "red"

> a$size

[1] 20

> a$shape

[1] "round"

Example: Simple Calculator in R

# Program make a simple calculator that can add, subtract, multiply and divide using
functions
add <- function(x, y) {

return(x + y)

subtract <- function(x, y) {

return(x - y)

multiply <- function(x, y) {

return(x * y)

divide <- function(x, y) {

return(x / y)

# take input from the user

print("Select operation.")

print("1.Add")

print("2.Subtract")

print("3.Multiply")
print("4.Divide")

choice = as.integer(readline(prompt="Enter choice[1/2/3/4]: "))

num1 = as.integer(readline(prompt="Enter first number: "))

num2 = as.integer(readline(prompt="Enter second number: "))

operator <- switch(choice,"+","-","*","/")

result <- switch(choice, add(num1, num2), subtract(num1, num2), multiply(num1,


num2), divide(num1, num2))

print(paste(num1, operator, num2, "=", result))

Output

[1] "Select operation."

[1] "1.Add"

[1] "2.Subtract"

[1] "3.Multiply"

[1] "4.Divide"

Enter choice[1/2/3/4]: 4

Enter first number: 20

Enter second number: 4


[1] "20 / 4 = 5"

R Graphs & Charts

• R Programming Bar Plot


• R Programming Histogram
• R Programming Pie Chart
• R Box Plot
• R plot

R Bar Plot

Bar plots can be created in R using the barplot() function. We can supply a vector or
matrix to this function. If we supply a vector, the plot will have bars with their
heights equal to the elements in the vector.
Let us suppose, we have a vector of maximum temperatures (in degree Celsius) for
seven days as follows.

max.temp <- c(22, 27, 26, 24, 23, 26, 28)

Now we can make a bar plot out of this data.

barplot(max.temp)
This function can take a lot of argument to control the way our data is plotted. You
can read about them in the help section ?barplot.
Some of the frequently used ones are, main to give the title, xlab and ylab to provide
labels for the axes, names.arg for naming each bar, col to define color etc.
We can also plot bars horizontally by providing the argument horiz = TRUE.

# barchart with added parameters

barplot(max.temp,

main = "Maximum Temperatures in a Week",

xlab = "Degree Celsius",

ylab = "Day",

names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"),

col = "darkred",

horiz = TRUE)
Plotting Categorical Data
Sometimes we have to plot the count of each item as bar plots from categorical data.
For example, here is a vector of age of 10 college freshmen.

age <- c(17,18,18,17,18,19,18,16,18,18)

Simply doing barplot(age) will not give us the required plot. It will plot 10 bars with
height equal to the student’s age. But we want to know the number of student in each
age category.
This count can be quickly found using the table() function, as shown below.

> table(age)

age

16 17 18 19

1 2 6 1

Now plotting this data will give our required bar plot. Note below, that we define
the argument density to shade the bars.

barplot(table(age),

main="Age Count of 10 Students",

xlab="Age",

ylab="Count",

border="red",

col="blue",

density=10

)
How to plot higher dimensional tables?
Sometimes the data is in the form of a contingency table. For example, let us take
the built-in Titanic dataset.
This data set provides information on the fate of passengers on the fatal maiden
voyage of the ocean liner ‘Titanic’, summarized according to economic status
(class), sex, age and survival.-R documentation.

> Titanic

, , Age = Child, Survived = No

Sex

Class Male Female


1st 0 0

2nd 0 0

3rd 35 17

Crew 0 0

, , Age = Adult, Survived = No

Sex

Class Male Female

1st 118 4

2nd 154 13

3rd 387 89

Crew 670 3

, , Age = Child, Survived = Yes

Sex

Class Male Female

1st 5 1

2nd 11 13

3rd 13 14
Crew 0 0

, , Age = Adult, Survived = Yes

Sex

Class Male Female

1st 57 140

2nd 14 80

3rd 75 76

Crew 192 20

We can see that this data has 4 dimensions, class, sex, age and survival. Suppose we
wanted to bar plot the count of males and females.
In this case we can use the margin.table() function. This function sums up the table
entries according to the given index.

> margin.table(Titanic,1) # count according to class

Class

1st 2nd 3rd Crew

325 285 706 885

> margin.table(Titanic,4) # count according to survival

Survived

No Yes
1490 711

> margin.table(Titanic) # gives total count if index is not provided

[1] 2201

Now that we have our data in the required format, we can plot, survival for example,
as barplot(margin.table(Titanic,4)) or plot male vs female count
as barplot(margin.table(Titanic,2)).

How to plot barplot with matrix?


As mentioned before, barplot() function can take in vector as well as matrix. If the
input is matrix, a stacked bar is plotted. Each column of the matrix will be
represented by a stacked bar.
Let us consider the following matrix which is derived from our Titanic dataset.

> titanic.data

Class

Survival 1st 2nd 3rd Crew

No 122 167 528 673

Yes 203 118 178 212

This data is plotted as follows.

barplot(titanic.data,

main = "Survival of Each Class",


xlab = "Class",

col = c("red","green")

legend("topleft",

c("Not survived","Survived"),

fill = c("red","green")

)
We have used the legend() function to appropriately display the legend.
Instead of a stacked bar we can have different bars for each element in a column
juxtaposed to each other by specifying the parameter beside = TRUE as shown
below.
R Histograms

In this article, you’ll learn to use hist() function to create histograms in R


programming with the help of numerous examples.

Histogram can be created using the hist() function in R programming language. This
function takes in a vector of values for which the histogram is plotted.
Let us use the built-in dataset airquality which has Daily air quality measurements
in New York, May to September 1973.-R documentation.

> str(airquality)

'data.frame': 153 obs. of 6 variables:

$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...

$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...


$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...

$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...

$ Month : int 5 5 5 5 5 5 5 5 5 5 ...

$ Day : int 1 2 3 4 5 6 7 8 9 10 ...

We will use the temperature parameter which has 154 observations in degree
Fahrenheit.

Example 1: Simple histogram

Temperature <- airquality$Temp

hist(Temperature)
We can see above that there are 9 cells with equally spaced breaks. In this case, the
height of a cell is equal to the number of observation falling in that cell.
We can pass in additional parameters to control the way our plot looks. You can read
about them in the help section ?hist.
Some of the frequently used ones are, main to give the title, xlab and ylab to provide
labels for the axes, xlim and ylim to provide range of the axes, col to define color
etc.
Additionally, with the argument freq=FALSE we can get the probability distribution
instead of the frequency.

Example 2: Histogram with added parameters

# histogram with added parameters

hist(Temperature,

main="Maximum daily temperature at La Guardia Airport",

xlab="Temperature in degrees Fahrenheit",

xlim=c(50,100),

col="darkmagenta",

freq=FALSE

)
Note that the y axis is labelled density instead of frequency. In this case, the total
area of the histogram is equal to 1.

Defining the Number of Breaks


With the breaks argument we can specify the number of cells we want in the
histogram. However, this number is just a suggestion.
R calculates the best number of cells, keeping this suggestion in mind. Following
are two histograms on the same data with different number of cells.

Example 4: Histogram with different breaks

hist(Temperature, breaks=4, main="With breaks=4")

hist(Temperature, breaks=20, main="With breaks=20")


In the above figure we see that the actual number of cells plotted is greater than we
had specified.
We can also define breakpoints between the cells as a vector. This makes it possible
to plot a histogram with unequal intervals. In such case, the area of the cell is
proportional to the number of observations falling inside that cell.

Example 5: Histogram with non-uniform width

hist(Temperature,

main="Maximum daily temperature at La Guardia Airport",

xlab="Temperature in degrees Fahrenheit",

xlim=c(50,100),

col="chocolate",

border="brown",
breaks=c(55,60,70,75,80,100)

R Pie Chart

Pie chart is drawn using the pie() function in R programming . This function takes
in a vector of non-negative numbers.

> expenditure

Housing Food Cloths Entertainment Other

600 300 150 100 200

Let us consider the above data represents the monthly expenditure breakdown of an
individual.
Example: Simple pie chart using pie()
Now let us draw a simple pie chart out of this data using the pie() function.
expenditure<-c(600,300,150,100,200)

pie(expenditure)

We can see above that a pie chart was plotted with 5 slices. The chart was drawn in
anti-clockwise direction using pastel colors.
We can pass in additional parameters to affect the way pie chart is drawn. You can
read about it in the help section ?pie.
Some of the frequently used ones are, labels-to give names to slices, main-to add a
title, col-to define colors for the slices and border-to color the borders.
We can also pass the argument clockwise=TRUE to draw the chart in clockwise
fashion.

Example 2: Pie chart with additional parameters

pie(expenditure,

labels=as.character(expenditure),

main="Monthly Expenditure Breakdown",


col=c("red","orange","yellow","blue","green"),

border="brown",

clockwise=TRUE

As seen in the above figure, we have used the actual amount as labels. Also, the chart
is drawn in clockwise fashion.

R Box Plot

In R, boxplot (and whisker plot) is created using the boxplot() function.


The boxplot() function takes in any number of numeric vectors, drawing a boxplot
for each vector.
You can also pass in a list (or data frame) with numeric vectors as its components.
Let us use the built-in dataset airquality which has “Daily air quality measurements
in New York, May to September 1973.”-R documentation.

> str(airquality)
'data.frame': 153 obs. of 6 variables:

$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...

$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...

$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...

$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...

$ Month : int 5 5 5 5 5 5 5 5 5 5 ...

$ Day : int 1 2 3 4 5 6 7 8 9 10 ...

Let us make a boxplot for the ozone readings.

boxplot(airquality$Ozone)
We can see that data above the median is more dispersed. We can also notice two
outliers at the higher extreme.
We can pass in additional parameters to control the way our plot looks. You can read
about them in the help section ?boxplot.
Some of the frequently used ones are, main-to give the title, xlab and ylab-to provide
labels for the axes, col to define color etc.
Additionally, with the argument horizontal = TRUE we can plot it horizontally and
with notch = TRUE we can add a notch to the box.

boxplot(airquality$Ozone,

main = "Mean ozone in parts per billion at Roosevelt Island",

xlab = "Parts Per Billion",

ylab = "Ozone",

col = "orange",

border = "brown",

horizontal = TRUE,

notch = TRUE

)
Multiple Boxplots
We can draw multiple boxplots in a single plot, by passing in a list, data frame or
multiple vectors.
Let us consider the Ozone and Temp field of airquality dataset.

boxplot(ozone, temp,

main = "Multiple boxplots for comparision",

names = c("ozone", "temp"),

col = c("orange","red"),

border = "brown",

horizontal = TRUE,

notch = TRUE
)

Boxplot form Formula


The function boxplot() can also take in formulas of the form y~x where, y is a
numeric vector which is grouped according to the value of x.
For example, in our dataset airquality, the Temp can be our numeric vector. Month
can be our grouping variable, so that we get the boxplot for each month separately.
In our dataset, month is in the form of number (1=January, 2-Febuary and so on).

boxplot(Temp~Month,

data=airquality,

main="Different boxplots for each month",

xlab="Month Number",
ylab="Degree Fahrenheit",

col="orange",

border="brown"

It is clear from the above figure that the month number 7 (July) is relatively hotter
than the rest.
R Plot Function

The most used plotting function in R programming is the plot() function. It is a


generic function, meaning, it has many methods which are called according to the
type of object passed to plot().

In the simplest case, we can pass in a vector and we will get a scatter plot of
magnitude vs index. But generally, we pass in two vectors and a scatter plot of these
points are plotted.
For example, the command plot(c(1,2),c(3,5)) would plot the points (1,3) and (2,5).
Here is a more concrete example where we plot a sine function form range -pi to pi.

x <- seq(-pi,pi,0.1)

plot(x, sin(x))

Adding Titles and Labeling Axes


We can add a title to our plot with the parameter main. Similarly, xlab and ylab can
be used to label the x-axis and y-axis respectively.

plot(x, sin(x),

main="The Sine Function",

ylab="sin(x)")
Changing Color and Plot Type
We can see above that the plot is of circular points and black in color. This is the
default color.
We can change the plot type with the argument type. It accepts the following strings
and has the given effect.

"p" - points

"l" - lines

"b" - both points and lines

"c" - empty points joined by lines

"o" - overplotted points and lines

"s" and "S" - stair steps


"h" - histogram-like vertical lines

"n" - does not produce any points or lines

Similarly, we can define the color using col.

plot(x, sin(x),

main="The Sine Function",

ylab="sin(x)",

type="l",

col="blue")
Overlaying Plots Using legend() function
Calling plot() multiple times will have the effect of plotting the current graph on the
same window replacing the previous one.
However, sometimes we wish to overlay the plots in order to compare the results.
This is made possible with the functions lines() and points() to add lines and points
respectively, to the existing plot.

plot(x, sin(x),

main="Overlaying Graphs",

ylab="",

type="l",

col="blue")

lines(x,cos(x), col="red")

legend("topleft",

c("sin(x)","cos(x)"),

fill=c("blue","red")

)
We have used the function legend() to appropriately display the legend.

Importing Data

The first step to any data analysis process is to get the data. Data can come from
many sources but two of the most common include text and Excel files.

Text file formats use delimiters to separate the different elements in a line, and
each line of data is in its own line in the text file. Therefore, importing different
kinds of text files can follow a fairly consistent process once you’ve identified the
delimiter.

What is a CSV file?


CSV files are file formats that contain plain text values separated by commas.

CSV files can be opened by any spreadsheet program: Microsoft Excel, Open Office,
Google Sheets, etc. You can open a CSV file in a simple text editor as well. It is a very
widespread and popular file format for storing and reading data because it is simple
and it’s compatible with most platforms. But this simplicity has some disadvantages.
CSV is only capable of storing a single sheet in a file, without any formatting and
formulas.
Here’s an example CSV spreadsheet:

And this is the same file in txt:

Rank,Movie,Director,Year,Gross profit

1,The Shawshank Redemption,Frank Darabont,1994,28341469

2,The Godfather,Francis Ford Coppola,1972,134821952

3,The Dark Knight,Christopher Nolan,2008,533316061

4,12 Angry Men,Sidney Lumet,1957,0

Base R functions
read.table() is a multipurpose work-horse function in base R for importing data.
The functions read.csv() and read.delim() are special cases of read.table() in which
the defaults have been adjusted for efficiency.

variable 1,variable 2,variable 3


10,beer,TRUE
25,wine,TRUE
8,cheese,FALSE

To read in the CSV file we can use read.csv().

mydata <- read.csv("mydata.csv")

View(mydata)
str(mydata)
str(mydata)
## 'data.frame': 3 obs. of 3 variables:
## $ variable.1: int 10 25 8
## $ variable.2: Factor w/ 3 levels "beer","cheese",..: 1 3 2
## $ variable.3: logi TRUE TRUE FALSE

However, we may want to read in variable.2 as a character variable rather then a


factor. We can take care of this by changing the stringsAsFactors argument. The
default has stringsAsFactors = TRUE; however, setting it equal to FALSE will
read in the variable as a character variable.

mydata_2 <- read.csv("mydata.csv", stringsAsFactors = FALSE)


mydata_2
## variable.1 variable.2 variable.3
## 1 10 beer TRUE
## 2 25 wine TRUE
## 3 8 cheese FALSE

str(mydata_2)
## 'data.frame': 3 obs. of 3 variables:
## $ variable.1: int 10 25 8
## $ variable.2: chr "beer" "wine" "cheese"
## $ variable.3: logi TRUE TRUE FALSE

As previously stated read.csv is just a wrapper function for read.table but with
adjusted default arguments. Therefore, we can use read.table to read in this same
data. The two arguments we need to be aware of are the field separator (sep) and
the argument indicating whether the file contains the names of the variables as its
first line (header). In read.table the defaults are sep = "" and header =
FALSE whereas in read.csv the defaults are sep = "," and header = TRUE.

# provides same results as read.csv above


read.table("mydata.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE)
## variable.1 variable.2 variable.3
## 1 10 beer TRUE
## 2 25 wine TRUE
## 3 8 cheese FALSE

# set column and row names


read.table("mydata.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE,
col.names = c("Var 1", "Var 2", "Var 3"),
row.names = c("Row 1", "Row 2", "Row 3"))
## Var.1 Var.2 Var.3
## Row 1 10 beer TRUE
## Row 2 25 wine TRUE
## Row 3 8 cheese FALSE

In addition to .csv files, there are other text files that read.table works with. The
primary difference is what separates the elements. For example, tab delimited text
files typically end with the .txt and .tsv extensions. You can also use
the read.delim() function as, similiar to read.csv(), read.delim() is a wrapper
of read.table() with defaults set specifically for tab delimited files. We can read in
this .txt file with the following:

# reading in tab delimited text files


read.delim("mydata.txt")
## variable.1 variable.2 variable.3
## 1 10 beer TRUE
## 2 25 wine TRUE
## 3 8 cheese FALSE

# provides same results as read.delim


read.table("mydata.txt", sep = "\t", header = TRUE)
## variable.1 variable.2 variable.3
## 1 10 beer TRUE
## 2 25 wine TRUE
## 3 8 cheese FALSE

Readr package: read_csv function

library(readr)
mydata_3 <- read_csv("mydata.csv")
mydata_3
## variable 1 variable 2 variable 3
## 1 10 beer TRUE
## 2 25 wine TRUE
## 3 8 cheese FALSE

str(mydata_3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of 3 variables:
## $ variable 1: int 10 25 8
## $ variable 2: chr "beer" "wine" "cheese"
## $ variable 3: logi TRUE TRUE FALSE

From Excel

you can use the xlsx package to access Excel files. The first row should contain
variable/column names.
# read in the first worksheet from the workbook myexcel.xlsx

# first row contains variable names

library(xlsx)

mydata <- read.xlsx("c:/myexcel.xlsx", 1)

# read in the worksheet named mysheet

mydata <- read.xlsx("c:/myexcel.xlsx", sheetName = "mysheet")

readxl to import Excel data

library(readxl)

mydata <- read_excel("mydata.xlsx", sheet = "PICK_ME_FIRST!")


mydata
## # A tibble: 3 x 3
## `variable 1` `variable 2` `variable 3`
## <dbl> <chr> <lgl>
## 1 10 beer TRUE
## 2 25 wine TRUE
## 3 8 cheese FALSE

str(mydata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of 3 variables:
## $ variable 1: num 10 25 8
## $ variable 2: chr "beer" "wine" "cheese"
## $ variable 3: logi TRUE TRUE FALSE

Exporting Data
This section will cover how to export data to text files, Excel files (along with
some additional formatting capabilities)

Base R functions
write.table() is the multipurpose work-horse function in base R for exporting data.
The functions write.csv() and write.delim() are special cases of write.table() in
which the defaults have been adjusted for efficiency. To illustrate these functions
let’s work with a data frame that we wish to export to a CSV file in our working
directory.

df <- data.frame(var1 = c(10, 25, 8),


var2 = c("beer", "wine", "cheese"),
var3 = c(TRUE, TRUE, FALSE),
row.names = c("billy", "bob", "thornton"))

df
## var1 var2 var3
## billy 10 beer TRUE
## bob 25 wine TRUE
## thornton 8 cheese FALSE
To export df to a .csv file we can use write.csv(). Additional arguments allow you
to exclude row and column names, specify what to use for missing values, add or
remove quotations around character strings, etc.

# write to a csv file


write.csv(df, file = "export_csv")

# write to a csv and save in a different directory


write.csv(df, file = "folder/subfolder/subsubfolder/export_csv")

# write to a csv file with added arguments


write.csv(df, file = "export_csv", row.names = FALSE, na = "MISSING!")

In addition to .csv files, we can also write to other text files


using write.table and write.delim().

# write to a tab delimited text files


write.delim(df, file = "export_txt")

# provides same results as read.delim


write.table(df, file = "export_txt", sep="\t")

readr package
The readr package uses write functions similar to base R. However, readr write
functions are about twice as fast and they do not write row names. One thing to
note, where base R write functions use the file = argument, readr write functions
use path =.

library(readr)

# write to a csv file


write_csv(df, path = "export_csv2")

# write to a csv and save in a different directory


write_csv(df, path = "folder/subfolder/subsubfolder/export_csv2")

# write to a csv file without column names


write_csv(df, path = "export_csv2", col_names = FALSE)
# write to a txt file without column names
write_delim(df, path = "export_txt2", col_names = FALSE)

Exporting to Excel files

As previously mentioned, many organizations still rely on Excel to hold and share
data

xlsx package
The xlsx package provides exporting and formatting capabibilities for Excel 2007
and Excel 97/2000/XP/2003 file formats. Although these file formats are a bit
outdated this package provides some nice formatting options. Saving a data frame
to a .xlsx file is as easy as saving to a .csv file:

library(xlsx)

# write to a .xlsx file


write.xlsx(df, file = "output_example.xlsx")

# write to a .xlsx file without row names


write.xlsx(df, file = "output_example.xlsx", row.names = FALSE)

You might also like