Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Da Unit-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

DATA ANALYTICS

Unit 1: An overview of R, Vectors, factors, univariate time series, Data frames, matrices, Functions, operators, loops,
Graphics, Revealing views of the data, Data summary, Statistical analysis questions, aims, and strategies; Statistical
models, Distributions: models for the random component, Simulation of random numbers and random samples,
Model assumptions

Text Books:
1. Data Analysis and Graphics Using R – an Example-Based Approach, John Maindonald, W. John Braun, Third
Edition, 2010

An overview of R
R is a programming language.
R is often used for statistical computing and graphical presentation to analyze and visualize data.

Why Use R?
 It is a great resource for data analysis, data visualization, data science and machine learning
 It provides many statistical techniques (such as statistical tests, classification, clustering and data
reduction)
 It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc
 It works on different platforms (Windows, Mac, Linux)
 It is open-source and free
 It has a large community support
 It has many packages (libraries of functions) that can be used to solve different problems

How to Install R
To install R, go to https://cloud.r-project.org/ and download the latest version of R for Windows, Mac or Linux.
When you have downloaded and installed R, you can run R on your computer.
The screenshot below shows how it may look like when you run R on a Windows PC:

If you type 5 + 5, and press enter, you will see that R outputs 10.

The command line prompt (>) is an invitation to type commands or expressions. Once the command or expression
is complete, and the Enter key is pressed, R evaluates and prints the result in the console window. This allows the
use of R as a calculator.
For example, type 2+2 and press the Enter key. Here is what appears on the screen:
> 2+2
[1] 4
>

The first element is labeled [1] even when, as here, there is just one element! The final > prompt indicates that R is
ready for another command.

> 2*3*4*5 # * denotes ’multiply’


[1] 120
> sqrt(10) # the square root of 10
[1] 3.162278
> pi # R knows about pi
[1] 3.141593
> 2*pi*6378 # Circumference of earth at equator (km)
# (radius at equator is 6378 km)
[1] 40074.16

Anything that follows a # on the command line is taken as comment and ignored by R.
A continuation prompt, by default +, appears following a carriage return when the command is not yet complete.
For example, an interruption of the calculation of 3*4ˆ2 by a carriage return could appear as
> 3*4ˆ
+2
[1] 48

Multiple commands may appear on one line, with a semicolon (;) as the separator. For example,
> 3*4ˆ2; (3*4)ˆ2
[1] 48
[1] 144

To output text in R, use single or double quotes:


Example
"Hello World!"

To output numbers, just type the number (without quotes):


Example
5
10
25

To do simple calculations, add numbers together:

Example
5+5

Print
Unlike many other programming languages, you can output code in R without using a print function:
Example
"Hello World!"

However, R does have a print() function available if you want to use it. This might be useful if you are familiar with
other programming languages, such as Python, which often uses the print() function to output code.
Example
print("Hello World!")

And there are times you must use the print() function to output code, for example when working with for loops
(which you will learn more about in a later chapter):
Example
for (x in 1:10)
{
print(x)
}

Conclusion: It is up to you whether you want to use the print() function to output code. However, when your code
is inside an R expression (for example inside curly braces {} like in the example above), use the print() function to
output the result.

Creating Variables in R
Variables are containers for storing data values.
R does not have a command for declaring a variable. A variable is created the moment you first assign a value to it.
To assign a value to a variable, use the <- sign. To output (or print) the variable value, just type the variable name:

Example
name <- "John"
age <- 40
name # output "John"
age # output 40

From the example above, name and age are variables, while "John" and 40 are values.
In other programming language, it is common to use = as an assignment operator. In R, we can use both = and <- as
assignment operators.
However, <- is preferred in most cases because the = operator can be forbidden in some context in R.

Print / Output Variables


Compared to many other programming languages, you do not have to use a function to print/output variables in R.
You can just type the name of the variable:
Example
name <- "John Doe"

name # auto-print the value of the name variable


However, R does have a print() function available if you want to use it. This might be useful if you are familiar with
other programming languages, such as Python, which often use a print() function to output variables.
Example
name <- "John Doe"
print(name) # print the value of the name variable
And there are times you must use the print() function to output code, for example when working with for loops
(which you will learn more about in a later chapter):
Example
for (x in 1:10)
{
print(x)
}

Data Types
In programming, data type is an important concept.
Variables can store data of different types, and different types can do different things.
In R, variables do not need to be declared with any particular type, and can even change type after they have been
set:
Example
my_var <- 30 # my_var is type of numeric
my_var <- "Sally" # my_var is now of type character (aka string)

R has a variety of data types and object classes.

Basic Data Types


Basic data types in R can be divided into the following types:
 numeric - (10.5, 55, 787)
 integer - (1L, 55L, 100L, where the letter "L" declares this as an integer)
 complex - (9 + 3i, where "i" is the imaginary part)
 character (a.k.a. string) - ("k", "R is exciting", "FALSE", "11.5")
 logical (a.k.a. boolean) - (TRUE or FALSE)

We can use the class() function to check the data type of a variable:
Example
# numeric
x <- 10.5
class(x)

# integer
x <- 1000L
class(x)

# complex
x <- 9i + 3
class(x)

# character/string
x <- "R is exciting"
class(x)

# logical/boolean
x <- TRUE
class(x)

Numbers
There are three number types in R:
 numeric
 integer
 complex
Variables of number types are created when you assign a value to them:
Example
x <- 10.5 # numeric
y <- 10L # integer
z <- 1i # complex

Numeric
A numeric data type is the most common type in R, and contains any number with or without a decimal, like: 10.5,
55, 787:
Example
x <- 10.5
y <- 55

# Print values of x and y


x
y

# Print the class name of x and y


class(x)
class(y)

Integer
Integers are numeric data without decimals. This is used when you are certain that you will never create a variable
that should contain decimals. To create an integer variable, you must use the letter L after the integer value:
Example
x <- 1000L
y <- 55L

# Print values of x and y


x
y

# Print the class name of x and y


class(x)
class(y)

Complex
A complex number is written with an "i" as the imaginary part:
Example
x <- 3+5i
y <- 5i

# Print values of x and y


x
y

# Print the class name of x and y


class(x)
class(y)

Type Conversion
You can convert from one type to another with the following functions:
 as.numeric()
 as.integer()
 as.complex()
Example
x <- 1L # integer
y <- 2 # numeric

# convert from integer to numeric:


a <- as.numeric(x)

# convert from numeric to integer:


b <- as.integer(y)

# print values of x and y


x
y

# print the class name of a and b


class(a)
class(b)

Simple Math
In R, you can use operators to perform common mathematical operations on numbers.
The + operator is used to add together two values:
Example
10 + 5
And the - operator is used for subtraction:
Example
10 - 5

Built-in Math Functions


R also has many built-in math functions that allows you to perform mathematical tasks on numbers.
For example, the min() and max() functions can be used to find the lowest or highest number in a set:
Example
max(5, 10, 15)

min(5, 10, 15)

sqrt()
The sqrt() function returns the square root of a number:
Example
sqrt(16)

abs()
The abs() function returns the absolute (positive) value of a number:
Example
abs(-4.7)

ceiling() and floor()


The ceiling() function rounds a number upwards to its nearest integer, and the floor() function rounds a number
downwards to its nearest integer, and returns the result:
Example
ceiling(1.4)
floor(1.4)

String Literals
A character, or strings, are used for storing text. A string is surrounded by either single quotation marks, or double
quotation marks:
"hello" is the same as 'hello':
Example
"hello"
'hello'

Assign a String to a Variable


Assigning a string to a variable is done with the variable followed by the <- operator and the string:
Example
str <- "Hello"
str # print the value of str

Multiline Strings
You can assign a multiline string to a variable like this:
Example
str <- "Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua."

str # print the value of str


However, note that R will add a "\n" at the end of each line break. This is called an escape character, and
the n character indicates a new line.

If you want the line breaks to be inserted at the same position as in the code, use the cat() function:
Example
str <- "Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua."

cat(str)

String Length
There are many usesful string functions in R.
For example, to find the number of characters in a string, use the nchar() function:
Example
str <- "Hello World!"

nchar(str)

Check a String
Use the grepl() function to check if a character or a sequence of characters are present in a string:
Example
str <- "Hello World!"

grepl("H", str)
grepl("Hello", str)
grepl("X", str)

Combine Two Strings


Use the paste() function to merge/concatenate two strings:
Example
str1 <- "Hello"
str2 <- "World"

paste(str1, str2)

Escape Characters
To insert characters that are illegal in a string, you must use an escape character.
An escape character is a backslash \ followed by the character you want to insert.
An example of an illegal character is a double quote inside a string that is surrounded by double quotes:
Example
str <- "We are the so-called "Vikings", from the north."

str
Result:
Error: unexpected symbol in "str <- "We are the so-called "Vikings"

To fix this problem, use the escape character \":


Example
The escape character allows you to use double quotes when you normally would not be allowed:
str <- "We are the so-called \"Vikings\", from the north."

str
cat(str)
Note that auto-printing the str variable will print the backslash in the output. You can use the cat() function to
print it without backslash.
Other escape characters in R:
Code Result

\\ Backslash

\n New Line

\r Carriage Return

\t Tab

\b Backspace

Booleans (Logical Values)


In programming, you often need to know if an expression is true or false.
You can evaluate any expression in R, and get one of two answers, TRUE or FALSE.
When you compare two values, the expression is evaluated and R returns the logical answer:
Example
10 > 9 # TRUE because 10 is greater than 9
10 == 9 # FALSE because 10 is not equal to 9
10 < 9 # FALSE because 10 is greater than 9

You can also compare two variables:


Example
a <- 10
b <- 9

a>b
You can also run a condition in an if statement, which you will learn much more about in the if..else chapter.
Example
a <- 200
b <- 33

if (b > a) {
print ("b is greater than a")
} else {
print("b is not greater than a")
}

Operators
Operators are used to perform operations on variables and values.
In the example below, we use the + operator to add together two values:
Example
10 + 5

R divides the operators in the following groups:


 Arithmetic operators
 Assignment operators
 Comparison operators
 Logical operators
 Miscellaneous operators

R Arithmetic Operators
Arithmetic operators are used with numeric values to perform common mathematical operations:
Operator Name Example

+ Addition x+y

- Subtraction x-y

* Multiplication x*y

/ Division x/y

^ Exponent x^y

%% Modulus (Remainder from division) x %% y

%/% Integer Division x%/%y

R Assignment Operators
Assignment operators are used to assign values to variables:
Example
my_var <- 3
my_var <<- 3
3 -> my_var
3 ->> my_var
my_var # print my_var
Note: <<- is a global assigner..

It is also possible to turn the direction of the assignment operator.


x <- 3 is equal to 3 -> x

R Comparison Operators
Comparison operators are used to compare two values:
Operator Name Example

== Equal x == y

!= Not equal x != y

> Greater than x>y


< Less than x<y

>= Greater than or equal to x >= y

<= Less than or equal to x <= y

R Logical Operators
Logical operators are used to combine conditional statements:
Operator Description

& Element-wise Logical AND operator. It returns TRUE if both elements are TRUE

&& Logical AND operator - Returns TRUE if both statements are TRUE

| Elementwise- Logical OR operator. It returns TRUE if one of the statement is TRUE

|| Logical OR operator. It returns TRUE if one of the statement is TRUE.

! Logical NOT - returns FALSE if statement is TRUE

R Miscellaneous Operators
Miscellaneous operators are used to manipulate data:
Operator Description Example

: Creates a series of numbers in a sequence x <- 1:10

%in% Find out if an element belongs to a vector x %in% y

%*% Matrix Multiplication x <- Matrix1 %*% Matrix2

Conditions and If Statements


R supports the usual logical conditions from mathematics:
Operator Name Example

== Equal x == y

!= Not equal x != y

> Greater than x>y

< Less than x<y

>= Greater than or equal to x >= y

<= Less than or equal to x <= y


These conditions can be used in several ways, most commonly in "if statements" and loops.
An "if statement" is written with the if keyword, and it is used to specify a block of code to be executed if a
condition is TRUE:
Example
a <- 33
b <- 200

if (b > a) {
print("b is greater than a")
}

In this example we use two variables, a and b, which are used as a part of the if statement to test whether b is
greater than a. As a is 33, and b is 200, we know that 200 is greater than 33, and so we print to screen that "b is
greater than a".
R uses curly brackets { } to define the scope in the code.

Else If
The else if keyword is R's way of saying "if the previous conditions were not true, then try this condition":
Example
a <- 33
b <- 33

if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}

In this example a is equal to b, so the first condition is not true, but the else if condition is true, so we print to
screen that "a and b are equal".
You can use as many else if statements as you want in R.

If Else
The else keyword catches anything which isn't caught by the preceding conditions:
Example
a <- 200
b <- 33

if (b > a) {
print("b is greater than a")
} else if (a == b) {
print("a and b are equal")
} else {
print("a is greater than b")
}

In this example, a is greater than b, so the first condition is not true, also the else if condition is not true, so we go to
the else condition and print to screen that "a is greater than b".
You can also use else without else if:
Example
a <- 200
b <- 33
if (b > a) {
print("b is greater than a")
} else {
print("b is not greater than a")
}

Nested If Statements
You can also have if statements inside if statements, this is called nested if statements.
Example
x <- 41

if (x > 10) {
print("Above ten")
if (x > 20) {
print("and also above 20!")
} else {
print("but not above 20.")
}
} else {
print("below 10.")
}

AND
The & symbol (and) is a logical operator, and is used to combine conditional statements:
Example
Test if a is greater than b, AND if c is greater than a:
a <- 200
b <- 33
c <- 500

if (a > b & c > a){


print("Both conditions are true")
}

OR
The | symbol (or) is a logical operator, and is used to combine conditional statements:
Example
Test if a is greater than b, or if c is greater than a:
a <- 200
b <- 33
c <- 500

if (a > b | a > c){


print("At least one of the conditions is true")
}
Loops
Loops can execute a block of code as long as a specified condition is reached.
Loops are handy because they save time, reduce errors, and they make code more readable.
R has two loop commands:
 while loops
 for loops

R While Loops
With the while loop we can execute a set of statements as long as a condition is TRUE:
Example
Print i as long as i is less than 6:
i <- 1
while (i < 6) {
print(i)
i <- i + 1
}
In the example above, the loop will continue to produce numbers ranging from 1 to 5. The loop will stop at 6
because 6 < 6 is FALSE.
The while loop requires relevant variables to be ready, in this example we need to define an indexing variable, i,
which we set to 1.
Note: remember to increment i, or else the loop will continue forever.

Break
With the break statement, we can stop the loop even if the while condition is TRUE:
Example
Exit the loop if i is equal to 4.
i <- 1
while (i < 6) {
print(i)
i <- i + 1
if (i == 4) {
break
}
}
The loop will stop at 3 because we have chosen to finish the loop by using the break statement when i is equal to 4
(i == 4).

With the next statement, we can skip an iteration without terminating the loop:
Example
Skip the value of 3:
i <- 0
while (i < 6) {
i <- i + 1
if (i == 3) {
next
}
print(i)
}
When the loop passes the value 3, it will skip it and continue to loop.

Yahtzee!
If .. Else Combined with a While Loop
To demonstrate a practical example, let us say we play a game of Yahtzee!
Example
Print "Yahtzee!" If the dice number is 6:
dice <- 1
while (dice <= 6) {
if (dice < 6) {
print("No Yahtzee")
} else {
print("Yahtzee!")
}
dice <- dice + 1
}
If the loop passes the values ranging from 1 to 5, it prints "No Yahtzee". Whenever it passes the value 6, it prints
"Yahtzee!".

For Loops
A for loop is used for iterating over a sequence:

Example
for (x in 1:10) {
print(x)
}
This is less like the for keyword in other programming languages, and works more like an iterator method as found
in other object-orientated programming languages.

With the for loop we can execute a set of statements, once for each item in a vector, array, list, etc..

You will learn about lists and vectors, etc in a later chapter.

Example
Print every item in a list:

fruits <- list("apple", "banana", "cherry")

for (x in fruits) {
print(x)
}
Example
Print the number of dices:

dice <- c(1, 2, 3, 4, 5, 6)

for (x in dice) {
print(x)
}
The for loop does not require an indexing variable to set beforehand, like with while loops.

Break
With the break statement, we can stop the loop before it has looped through all the items:

Example
Stop the loop at "cherry":

fruits <- list("apple", "banana", "cherry")

for (x in fruits) {
if (x == "cherry") {
break
}
print(x)
}
The loop will stop at "cherry" because we have chosen to finish the loop by using the break statement when x is
equal to "cherry" (x == "cherry").

With the next statement, we can skip an iteration without terminating the loop:

Example
Skip "banana":

fruits <- list("apple", "banana", "cherry")

for (x in fruits) {
if (x == "banana") {
next
}
print(x)
}
When the loop passes "banana", it will skip it and continue to loop.

Yahtzee!
If .. Else Combined with a For Loop
To demonstrate a practical example, let us say we play a game of Yahtzee!

Example
Print "Yahtzee!" If the dice number is 6:

dice <- 1:6

for(x in dice) {
if (x == 6) {
print(paste("The dice number is", x, "Yahtzee!"))
} else {
print(paste("The dice number is", x, "Not Yahtzee"))
}
}
If the loop reaches the values ranging from 1 to 5, it prints "No Yahtzee" and its number. When it reaches the value
6, it prints "Yahtzee!" and its number.

Nested Loops
You can also have a loop inside of a loop:

Example
Print the adjective of each fruit in a list:

adj <- list("red", "big", "tasty")

fruits <- list("apple", "banana", "cherry")


for (x in adj) {
for (y in fruits) {
print(paste(x, y))
}
}

A function is a block of code which only runs when it is called.


You can pass data, known as parameters, into a function.
A function can return data as a result.

Creating a Function
To create a function, use the function() keyword:

Example
my_function <- function() { # create a function with the name my_function
print("Hello World!")
}
Call a Function
To call a function, use the function name followed by parenthesis, like my_function():

Example
my_function <- function() {
print("Hello World!")
}

my_function() # call the function named my_function


Arguments
Information can be passed into functions as arguments.

Arguments are specified after the function name, inside the parentheses. You can add as many arguments as you
want, just separate them with a comma.
The following example has a function with one argument (fname). When the function is called, we pass along a first
name, which is used inside the function to print the full name:

Example
my_function <- function(fname) {
paste(fname, "Griffin")
}

my_function("Peter")
my_function("Lois")
my_function("Stewie")
Parameters or Arguments?
The terms "parameter" and "argument" can be used for the same thing: information that are passed into a function.

From a function's perspective:

A parameter is the variable listed inside the parentheses in the function definition.

An argument is the value that is sent to the function when it is called.

Number of Arguments
By default, a function must be called with the correct number of arguments. Meaning that if your function expects 2
arguments, you have to call the function with 2 arguments, not more, and not less:

Example
This function expects 2 arguments, and gets 2 arguments:

my_function <- function(fname, lname) {


paste(fname, lname)
}

my_function("Peter", "Griffin")
If you try to call the function with 1 or 3 arguments, you will get an error:

Example
This function expects 2 arguments, and gets 1 argument:

my_function <- function(fname, lname) {


paste(fname, lname)
}

my_function("Peter")
Default Parameter Value
The following example shows how to use a default parameter value.

If we call the function without an argument, it uses the default value:

Example
my_function <- function(country = "Norway") {
paste("I am from", country)
}

my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is Norway
my_function("USA")

Return Values
To let a function return a result, use the return() function:

Example
my_function <- function(x) {
return (5 * x)
}

print(my_function(3))
print(my_function(5))
print(my_function(9))
The output of the code above will be:

[1] 15
[1] 25
[1] 45
Nested Functions
There are two ways to create a nested function:

Call a function within another function.


Write a function within a function.
Example
Call a function within another function:

Nested_function <- function(x, y) {


a <- x + y
return(a)
}

Nested_function(Nested_function(2,2), Nested_function(3,3))
Example Explained
The function tells x to add y.

The first input Nested_function(2,2) is "x" of the main function.

The second input Nested_function(3,3) is "y" of the main function.

The output is therefore (2+2) + (3+3) = 10.


Example
Write a function within a function:

Outer_func <- function(x) {


Inner_func <- function(y) {
a <- x + y
return(a)
}
return (Inner_func)
}
output <- Outer_func(3) # To call the Outer_func
output(5)
Example Explained
You cannot directly call the function because the Inner_func has been defined (nested) inside the Outer_func.

We need to call Outer_func first in order to call Inner_func as a second step.

We need to create a new variable called output and give it a value, which is 3 here.

We then print the output with the desired value of "y", which in this case is 5.

The output is therefore 8 (3 + 5).

Recursion
R also accepts function recursion, which means a defined function can call itself.

Recursion is a common mathematical and programming concept. It means that a function calls itself. This has the
benefit of meaning that you can loop through data to reach a result.

The developer should be very careful with recursion as it can be quite easy to slip into writing a function which
never terminates, or one that uses excess amounts of memory or processor power. However, when written
correctly, recursion can be a very efficient and mathematically-elegant approach to programming.

In this example, tri_recursion() is a function that we have defined to call itself ("recurse"). We use the k variable as
the data, which decrements (-1) every time we recurse. The recursion ends when the condition is not greater than
0 (i.e. when it is 0).

To a new developer it can take some time to work out how exactly this works, best way to find out is by testing and
modifying it.

Example
tri_recursion <- function(k) {
if (k > 0) {
result <- k + tri_recursion(k - 1)
print(result)
} else {
result = 0
return(result)
}
}
tri_recursion(6)

Global Variables
Variables that are created outside of a function are known as global variables.
Global variables can be used by everyone, both inside of functions and outside.
Example
Create a variable outside of a function and use it inside the function:
txt <- "awesome"
my_function <- function() {
paste("R is", txt)
}

my_function()

If you create a variable with the same name inside a function, this variable will be local, and can only be used inside
the function. The global variable with the same name will remain as it was, global and with the original value.
Example
Create a variable inside of a function with the same name as the global variable:
txt <- "global variable"
my_function <- function() {
txt = "fantastic"
paste("R is", txt)
}

my_function()

txt # print txt

If you try to print txt, it will return "global variable" because we are printing txt outside the function.

The Global Assignment Operator


Normally, when you create a variable inside a function, that variable is local, and can only be used inside that
function.
To create a global variable inside a function, you can use the global assignment operator <<-
Example
If you use the assignment operator <<-, the variable belongs to the global scope:
my_function <- function() {
txt <<- "fantastic"
paste("R is", txt)
}

my_function()

print(txt)

Also, use the global assignment operator if you want to change a global variable inside a function:
Example
To change the value of a global variable inside a function, refer to the variable by using the global assignment
operator <<-:
txt <- "awesome"
my_function <- function() {
txt <<- "fantastic"
paste("R is", txt)
}

my_function()

paste("R is", txt)

Vectors
A vector is simply a list of items that are of the same type.
To combine the list of items to a vector, use the c() function and separate the items by a comma.
In the example below, we create a vector variable called fruits, that combine strings:
Example
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits

In this example, we create a vector that combines numerical values:


Example
# Vector of numerical values
numbers <- c(1, 2, 3)

# Print numbers

numbers
To create a vector with numerical values in a sequence, use the : operator:
Example
# Vector with numerical values in a sequence
numbers <- 1:10

numbers
You can also create numerical values with decimals in a sequence, but note that if the last element does not belong
to the sequence, it is not used:
Example
# Vector with numerical decimals in a sequence
numbers1 <- 1.5:6.5
numbers1

# Vector with numerical decimals in a sequence where the last element is not used
numbers2 <- 1.5:6.3
numbers2
Result:
[1] 1.5 2.5 3.5 4.5 5.5 6.5
[1] 1.5 2.5 3.5 4.5 5.5

In the example below, we create a vector of logical values:


Example
# Vector of logical values
log_values <- c(TRUE, FALSE, TRUE, FALSE)

log_values

Vector Length
To find out how many items a vector has, use the length() function:
Example
fruits <- c("banana", "apple", "orange")

length(fruits)

Sort a Vector
To sort items in a vector alphabetically or numerically, use the sort() function:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
numbers <- c(13, 3, 5, 7, 20, 2)

sort(fruits) # Sort a string


sort(numbers) # Sort numbers

Access Vectors
You can access the vector items by referring to its index number inside brackets []. The first item has index 1, the
second item has index 2, and so on:
Example
fruits <- c("banana", "apple", "orange")

# Access the first item (banana)


fruits[1]

You can also access multiple elements by referring to different index positions with the c() function:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access the first and third item (banana and orange)


fruits[c(1, 3)]

You can also use negative index numbers to access all items except the ones specified:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access all items except for the first item


fruits[c(-1)]
Change an Item
To change the value of a specific item, refer to the index number:
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Change "banana" to "pear"


fruits[1] <- "pear"

# Print fruits
fruits

Repeat Vectors
To repeat vectors, use the rep() function:
Example
Repeat each value:
repeat_each <- rep(c(1,2,3), each = 3)

repeat_each
Example
Repeat the sequence of the vector:
repeat_times <- rep(c(1,2,3), times = 3)

repeat_times
Example
Repeat each value independently:
repeat_indepent <- rep(c(1,2,3), times = c(5,2,1))

repeat_indepent

Generating Sequenced Vectors


One of the examples on top, showed you how to create a vector with numerical values in a sequence with
the : operator:
Example
numbers <- 1:10

numbers
To make bigger or smaller steps in a sequence, use the seq() function:
Example
numbers <- seq(from = 0, to = 100, by = 20)
numbers

Note: The seq() function has three parameters: from is where the sequence starts, to is where the sequence stops,
and by is the interval of the sequence.

Vectors are the most basic R data objects and there are six types of atomic vectors. They are logical, integer, double,
complex, character and raw.
Vector Creation
Single Element Vector
Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of the above vector
types.
# Atomic vector of type character.
print("abc");

# Atomic vector of type double.


print(12.5)

# Atomic vector of type integer.


print(63L)

# Atomic vector of type logical.


print(TRUE)

# Atomic vector of type complex.


print(2+3i)

# Atomic vector of type raw.


print(charToRaw('hello'))
When we execute the above code, it produces the following result −
[1] "abc"
[1] 12.5
[1] 63
[1] TRUE
[1] 2+3i
[1] 68 65 6c 6c 6f

Multiple Elements Vector


Using colon operator with numeric data
# Creating a sequence from 5 to 13.
v <- 5:13
print(v)

# Creating a sequence from 6.6 to 12.6.


v <- 6.6:12.6
print(v)

# If the final element specified does not belong to the sequence then it is discarded.
v <- 3.8:11.4
print(v)
When we execute the above code, it produces the following result −
[1] 5 6 7 8 9 10 11 12 13
[1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6
[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8

Using sequence (Seq.) operator


# Create vector with elements from 5 to 9 incrementing by 0.4.
print(seq(5, 9, by = 0.4))
When we execute the above code, it produces the following result −
[1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0

Using the c() function


The non-character values are coerced to character type if one of the elements is a character.
# The logical and numeric values are converted to characters.
s <- c('apple','red',5,TRUE)
print(s)
When we execute the above code, it produces the following result −
[1] "apple" "red" "5" "TRUE"

Accessing Vector Elements


Elements of a Vector are accessed using indexing. The [ ] brackets are used for indexing. Indexing starts with
position 1. Giving a negative value in the index drops that element from result.TRUE, FALSE or 0 and 1 can also be
used for indexing.
# Accessing vector elements using position.
t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <- t[c(2,3,6)]
print(u)

# Accessing vector elements using logical indexing.


v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)

# Accessing vector elements using negative indexing.


x <- t[c(-2,-5)]
print(x)

# Accessing vector elements using 0/1 indexing.


y <- t[c(0,0,0,0,0,0,1)]
print(y)
When we execute the above code, it produces the following result −
[1] "Mon" "Tue" "Fri"
[1] "Sun" "Fri"
[1] "Sun" "Tue" "Wed" "Fri" "Sat"
[1] "Sun"

Vector Manipulation
Vector arithmetic
Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a vector output.
# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)

# Vector addition.
add.result <- v1+v2
print(add.result)

# Vector subtraction.
sub.result <- v1-v2
print(sub.result)

# Vector multiplication.
multi.result <- v1*v2
print(multi.result)

# Vector division.
divi.result <- v1/v2
print(divi.result)
When we execute the above code, it produces the following result −
[1] 7 19 4 13 1 13
[1] -1 -3 4 -3 -1 9
[1] 12 88 0 40 0 22
[1] 0.7500000 0.7272727 Inf 0.6250000 0.0000000 5.5000000
Vector Element Recycling
If we apply arithmetic operations to two vectors of unequal length, then the elements of the shorter vector are
recycled to complete the operations.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)

add.result <- v1+v2


print(add.result)

sub.result <- v1-v2


print(sub.result)
When we execute the above code, it produces the following result −
[1] 7 19 8 16 4 22
[1] -1 -3 0 -6 -4 0
Vector Element Sorting
Elements in a vector can be sorted using the sort() function.
v <- c(3,8,4,5,0,11, -9, 304)

# Sort the elements of the vector.


sort.result <- sort(v)
print(sort.result)

# Sort the elements in the reverse order.


revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)

# Sorting character vectors.


v <- c("Red","Blue","yellow","violet")
sort.result <- sort(v)
print(sort.result)

# Sorting character vectors in reverse order.


revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)
When we execute the above code, it produces the following result −
[1] -9 0 3 4 5 8 11 304
[1] 304 11 8 5 4 3 0 -9
[1] "Blue" "Red" "violet" "yellow"
[1] "yellow" "violet" "Red" "Blue"

Lists are the R objects which contain elements of different types like − numbers, strings, vectors and another list
inside it. A list can also contain a matrix or a function as its elements. List is created using list() function.

Creating a List
Following is an example to create a list containing strings, numbers, vectors and a logical values.

# Create a list containing strings, numbers, vectors and a logical


# values.
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)
When we execute the above code, it produces the following result −
[[1]]
[1] "Red"

[[2]]
[1] "Green"

[[3]]
[1] 21 32 11

[[4]]
[1] TRUE

[[5]]
[1] 51.23

[[6]]
[1] 119.1

Naming List Elements


The list elements can be given names and they can be accessed using these names.
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Show the list.
print(list_data)
When we execute the above code, it produces the following result −
$`1st_Quarter`
[1] "Jan" "Feb" "Mar"

$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8

$A_Inner_list
$A_Inner_list[[1]]
[1] "green"

$A_Inner_list[[2]]
[1] 12.3

Accessing List Elements


Elements of the list can be accessed by the index of the element in the list. In case of named lists it can also be
accessed using the names.
We continue to use the list in the above example −
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Access the first element of the list.


print(list_data[1])

# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])

# Access the list element using the name of the element.


print(list_data$A_Matrix)
When we execute the above code, it produces the following result −
$`1st_Quarter`
[1] "Jan" "Feb" "Mar"

$A_Inner_list
$A_Inner_list[[1]]
[1] "green"

$A_Inner_list[[2]]
[1] 12.3
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8

Manipulating List Elements


We can add, delete and update list elements as shown below. We can add and delete elements only at the end of a
list. But we can update any element.
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))

# Give names to the elements in the list.


names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")

# Add element at the end of the list.


list_data[4] <- "New element"
print(list_data[4])

# Remove the last element.


list_data[4] <- NULL

# Print the 4th Element.


print(list_data[4])

# Update the 3rd Element.


list_data[3] <- "updated element"
print(list_data[3])
When we execute the above code, it produces the following result −
[[1]]
[1] "New element"

$<NA>
NULL

$`A Inner list`


[1] "updated element"

Merging Lists
You can merge many lists into one list by placing all the lists inside one list() function.
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")

# Merge the two lists.


merged.list <- c(list1,list2)

# Print the merged list.


print(merged.list)

When we execute the above code, it produces the following result −


[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] "Sun"

[[5]]
[1] "Mon"

[[6]]
[1] "Tue"

Converting List to Vector


A list can be converted to a vector so that the elements of the vector can be used for further manipulation. All the
arithmetic operations on vectors can be applied after the list is converted into vectors. To do this conversion, we
use the unlist() function. It takes the list as input and produces a vector.
# Create lists.
list1 <- list(1:5)
print(list1)

list2 <-list(10:14)
print(list2)

# Convert the lists to vectors.


v1 <- unlist(list1)
v2 <- unlist(list2)

print(v1)
print(v2)

# Now add the vectors


result <- v1+v2
print(result)
When we execute the above code, it produces the following result −
[[1]]
[1] 1 2 3 4 5

[[1]]
[1] 10 11 12 13 14
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19

Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout. They
contain elements of the same atomic types. Though we can create a matrix containing only characters or only
logical values, they are not of much use. We use matrices containing numeric elements to be used in mathematical
calculations.
A Matrix is created using the matrix() function.

Syntax
The basic syntax for creating a matrix in R is −
matrix(data, nrow, ncol, byrow, dimnames)
Following is the description of the parameters used −
 data is the input vector which becomes the data elements of the matrix.
 nrow is the number of rows to be created.
 ncol is the number of columns to be created.
 byrow is a logical clue. If TRUE then the input vector elements are arranged by row.
 dimname is the names assigned to the rows and columns.

Example
Create a matrix taking a vector of numbers as input.
# Elements are arranged sequentially by row.
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M)

# Elements are arranged sequentially by column.


N <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(N)

# Define the column and row names.


rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")

P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))


print(P)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] 3 4 5
[2,] 6 7 8
[3,] 9 10 11
[4,] 12 13 14
[,1] [,2] [,3]
[1,] 3 7 11
[2,] 4 8 12
[3,] 5 9 13
[4,] 6 10 14
col1 col2 col3
row1 3 4 5
row2 6 7 8
row3 9 10 11
row4 12 13 14

Accessing Elements of a Matrix


Elements of a matrix can be accessed by using the column and row index of the element. We consider the matrix P
above to find the specific elements below.
# Define the column and row names.
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")

# Create the matrix.


P <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))

# Access the element at 3rd column and 1st row.


print(P[1,3])

# Access the element at 2nd column and 4th row.


print(P[4,2])

# Access only the 2nd row.


print(P[2,])

# Access only the 3rd column.


print(P[,3])
When we execute the above code, it produces the following result −
[1] 5
[1] 13
col1 col2 col3
678
row1 row2 row3 row4
5 8 11 14

Matrix Computations
Various mathematical operations are performed on the matrices using the R operators. The result of the operation
is also a matrix.
The dimensions (number of rows and columns) should be same for the matrices involved in the operation.

Matrix Addition & Subtraction


# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)

matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)


print(matrix2)
# Add the matrices.
result <- matrix1 + matrix2
cat("Result of addition","\n")
print(result)

# Subtract the matrices


result <- matrix1 - matrix2
cat("Result of subtraction","\n")
print(result)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] 3 -1 2
[2,] 9 4 6
[,1] [,2] [,3]
[1,] 5 0 3
[2,] 2 9 4
Result of addition
[,1] [,2] [,3]
[1,] 8 -1 5
[2,] 11 13 10
Result of subtraction
[,1] [,2] [,3]
[1,] -2 -1 -1
[2,] 7 -5 2

Matrix Multiplication & Division


# Create two 2x3 matrices.
matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
print(matrix1)

matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)


print(matrix2)

# Multiply the matrices.


result <- matrix1 * matrix2
cat("Result of multiplication","\n")
print(result)

# Divide the matrices


result <- matrix1 / matrix2
cat("Result of division","\n")
print(result)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] 3 -1 2
[2,] 9 4 6
[,1] [,2] [,3]
[1,] 5 0 3
[2,] 2 9 4
Result of multiplication
[,1] [,2] [,3]
[1,] 15 0 6
[2,] 18 36 24
Result of division
[,1] [,2] [,3]
[1,] 0.6 -Inf 0.6666667
[2,] 4.5 0.4444444 1.5000000

Arrays are the R data objects which can store data in more than two dimensions. For example − If we create an
array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can
store only data type.
An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter
to create an array.
Example
The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)

# Take these vectors as input to the array.


result <- array(c(vector1,vector2),dim = c(3,3,2))
print(result)
When we execute the above code, it produces the following result −
,,1

[,1] [,2] [,3]


[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15

,,2

[,1] [,2] [,3]


[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15

Naming Columns and Rows


We can give names to the rows, columns and matrices in the array by using the dimnames parameter.
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")
# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.names,column.names,
matrix.names))
print(result)
When we execute the above code, it produces the following result −
, , Matrix1

COL1 COL2 COL3


ROW1 5 10 13
ROW2 9 11 14
ROW3 3 12 15

, , Matrix2

COL1 COL2 COL3


ROW1 5 10 13
ROW2 9 11 14
ROW3 3 12 15

Accessing Array Elements


# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2","ROW3")
matrix.names <- c("Matrix1","Matrix2")

# Take these vectors as input to the array.


result <- array(c(vector1,vector2),dim = c(3,3,2),dimnames = list(row.names,
column.names, matrix.names))

# Print the third row of the second matrix of the array.


print(result[3,,2])

# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])

# Print the 2nd Matrix.


print(result[,,2])
When we execute the above code, it produces the following result −
COL1 COL2 COL3
3 12 15
[1] 13
COL1 COL2 COL3
ROW1 5 10 13
ROW2 9 11 14
ROW3 3 12 15
Manipulating Array Elements
As array is made up matrices in multiple dimensions, the operations on elements of array are carried out by
accessing elements of the matrices.

# Create two vectors of different lengths.


vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)

# Take these vectors as input to the array.


array1 <- array(c(vector1,vector2),dim = c(3,3,2))

# Create two vectors of different lengths.


vector3 <- c(9,1,0)
vector4 <- c(6,0,11,3,14,1,2,6,9)
array2 <- array(c(vector1,vector2),dim = c(3,3,2))

# create matrices from these arrays.


matrix1 <- array1[,,2]
matrix2 <- array2[,,2]

# Add the matrices.


result <- matrix1+matrix2
print(result)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] 10 20 26
[2,] 18 22 28
[3,] 6 24 30
Calculations Across Array Elements
We can do calculations across the elements in an array using the apply() function.
Syntax
apply(x, margin, fun)
Following is the description of the parameters used −
 x is an array.
 margin is the name of the data set used.
 fun is the function to be applied across the elements of the array.
Example
We use the apply() function below to calculate the sum of the elements in the rows of an array across all the
matrices.
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)

# Take these vectors as input to the array.


new.array <- array(c(vector1,vector2),dim = c(3,3,2))
print(new.array)

# Use apply to calculate the sum of the rows across all the matrices.
result <- apply(new.array, c(1), sum)
print(result)
When we execute the above code, it produces the following result −
,,1

[,1] [,2] [,3]


[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15

,,2

[,1] [,2] [,3]


[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15

[1] 56 68 60

Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings
and integers. They are useful in the columns which have a limited number of unique values. Like "Male, "Female"
and True, False etc. They are useful in data analysis for statistical modeling.
Factors are created using the factor () function by taking a vector as input.
Example

# Create a vector as input.


data <- c("East","West","East","North","North","East","West","West","West","East","North")

print(data)
print(is.factor(data))

# Apply the factor function.


factor_data <- factor(data)

print(factor_data)
print(is.factor(factor_data))
When we execute the above code, it produces the following result −
[1] "East" "West" "East" "North" "North" "East" "West" "West" "West" "East" "North"
[1] FALSE
[1] East West East North North East West West West East North
Levels: East North West
[1] TRUE
Factors in Data Frame
On creating any data frame with a column of text data, R treats the text column as categorical data and creates
factors on it.

# Create the vectors for data frame.


height <- c(132,151,162,139,166,147,122)
weight <- c(48,49,66,53,67,52,40)
gender <- c("male","male","female","female","male","female","male")

# Create the data frame.


input_data <- data.frame(height,weight,gender)
print(input_data)

# Test if the gender column is a factor.


print(is.factor(input_data$gender))

# Print the gender column so see the levels.


print(input_data$gender)
When we execute the above code, it produces the following result −
height weight gender
1 132 48 male
2 151 49 male
3 162 66 female
4 139 53 female
5 166 67 male
6 147 52 female
7 122 40 male
[1] TRUE
[1] male male female female male female male
Levels: female male

Changing the Order of Levels


The order of the levels in a factor can be changed by applying the factor function again with new order of the levels.
data <- c("East","West","East","North","North","East","West",
"West","West","East","North")
# Create the factors
factor_data <- factor(data)
print(factor_data)

# Apply the factor function with required order of the level.


new_order_data <- factor(factor_data,levels = c("East","West","North"))
print(new_order_data)
When we execute the above code, it produces the following result −
[1] East West East North North East West West West East North
Levels: East North West
[1] East West East North North East West West West East North
Levels: East West North
Generating Factor Levels
We can generate factor levels by using the gl() function. It takes two integers as input which indicates how many
levels and how many times each level.
Syntax
gl(n, k, labels)
Following is the description of the parameters used −
 n is a integer giving the number of levels.
 k is a integer giving the number of replications.
 labels is a vector of labels for the resulting factor levels.
Example
v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))
print(v)
When we execute the above code, it produces the following result −
Tampa Tampa Tampa Tampa Seattle Seattle Seattle Seattle Boston
[10] Boston Boston Boston
Levels: Tampa Seattle Boston

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one
variable and each row contains one set of values from each column.
Following are the characteristics of a data frame.
 The column names should be non-empty.
 The row names should be unique.
 The data stored in a data frame can be of numeric, factor or character type.
 Each column should contain same number of data items.

Create Data Frame


# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)
When we execute the above code, it produces the following result −
emp_id emp_name salary start_date
1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27

Get the Structure of the Data Frame


The structure of the data frame can be seen by using str() function.

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)
When we execute the above code, it produces the following result −
'data.frame': 5 obs. of 4 variables:
$ emp_id : int 1 2 3 4 5
$ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
$ salary : num 623 515 611 729 843
$ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-11" ...
Summary of Data in Data Frame
The statistical summary and nature of the data can be obtained by applying summary() function.

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))
When we execute the above code, it produces the following result −
emp_id emp_name salary start_date
Min. :1 Length:5 Min. :515.2 Min. :2012-01-01
1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23
Median :3 Mode :character Median :623.3 Median :2014-05-11
Mean :3 Mean :664.4 Mean :2014-01-14
3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
Max. :5 Max. :843.2 Max. :2015-03-27
Extract Data from Data Frame
Extract specific column from a data frame using column name.

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
When we execute the above code, it produces the following result −
emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
Extract the first two rows and then all columns

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract first two rows.
result <- emp.data[1:2,]
print(result)
When we execute the above code, it produces the following result −
emp_id emp_name salary start_date
1 1 Rick 623.3 2012-01-01
2 2 Dan 515.2 2013-09-23
Extract 3rd and 5th row with 2nd and 4th column

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)

# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
When we execute the above code, it produces the following result −
emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27

Expand Data Frame


A data frame can be expanded by adding columns and rows.

Add Column
Just add the column vector using a new column name.
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)

# Add the "dept" coulmn.


emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)
When we execute the above code, it produces the following result −
emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure
as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing data frame to create the
final data frame.

# Create the first data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)

# Create the second data frame


emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)

# Bind the two data frames.


emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)
When we execute the above code, it produces the following result −
emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Rasmi 578.00 2013-05-21 IT
7 7 Pranab 722.50 2013-07-30 Operations
8 8 Tusar 632.80 2014-06-17 Finance

R packages are a collection of R functions, complied code and sample data. They are stored under a directory
called "library" in the R environment. By default, R installs a set of packages during installation. More packages are
added later, when they are needed for some specific purpose. When we start the R console, only the default
packages are available by default. Other packages which are already installed have to be loaded explicitly to be
used by the R program that is going to use them.
All the packages available in R language are listed at R Packages.
Below is a list of commands to be used to check, verify and use the R packages.

Check Available R Packages


Get library locations containing R packages
.libPaths()
When we execute the above code, it produces the following result. It may vary depending on the local settings of
your pc.
[2] "C:/Program Files/R/R-3.2.2/library"
Get the list of all the packages installed

library()
When we execute the above code, it produces the following result. It may vary depending on the local settings of
your pc.
Install a New Package
There are two ways to add new R packages. One is installing directly from the CRAN directory and another is
downloading the package to your local system and installing it manually.
Install directly from CRAN
The following command gets the packages directly from CRAN webpage and installs the package in the R
environment. You may be prompted to choose a nearest mirror. Choose the one appropriate to your location.
install.packages("Package Name")

# Install the package named "XML".


install.packages("XML")
Install package manually
Go to the link R Packages to download the package needed. Save the package as a .zip file in a suitable location in
the local system.
Now you can run the following command to install this package in the R environment.
install.packages(file_name_with_path, repos = NULL, type = "source")

# Install the package named "XML"


install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")
Load Package to Library
Before a package can be used in the code, it must be loaded to the current R environment. You also need to load a
package that is already installed previously but not available in the current environment.
A package is loaded using the following command −
library("package Name", lib.loc = "path to library")

# Load the package named "XML"


install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")

Data Reshaping in R is about changing the way data is organized into rows and columns. Most of the time data
processing in R is done by taking the input data as a data frame. It is easy to extract data from the rows and
columns of a data frame but there are situations when we need the data frame in a format that is different from
format in which we received it. R has many functions to split, merge and change the rows to columns and vice-
versa in a data frame.
Joining Columns and Rows in a Data Frame
We can join multiple vectors to create a data frame using the cbind()function. Also we can merge two data frames
using rbind() function.

# Create vector objects.


city <- c("Tampa","Seattle","Hartford","Denver")
state <- c("FL","WA","CT","CO")
zipcode <- c(33602,98104,06161,80294)

# Combine above three vectors into one data frame.


addresses <- cbind(city,state,zipcode)

# Print a header.
cat("# # # # The First data frame\n")

# Print the data frame.


print(addresses)

# Create another data frame with similar columns


new.address <- data.frame(
city = c("Lowry","Charlotte"),
state = c("CO","FL"),
zipcode = c("80230","33949"),
stringsAsFactors = FALSE
)

# Print a header.
cat("# # # The Second data frame\n")

# Print the data frame.


print(new.address)

# Combine rows form both the data frames.


all.addresses <- rbind(addresses,new.address)

# Print a header.
cat("# # # The combined data frame\n")

# Print the result.


print(all.addresses)
When we execute the above code, it produces the following result −
# # # # The First data frame
city state zipcode
[1,] "Tampa" "FL" "33602"
[2,] "Seattle" "WA" "98104"
[3,] "Hartford" "CT" "6161"
[4,] "Denver" "CO" "80294"

# # # The Second data frame


city state zipcode
1 Lowry CO 80230
2 Charlotte FL 33949

# # # The combined data frame


city state zipcode
1 Tampa FL 33602
2 Seattle WA 98104
3 Hartford CT 6161
4 Denver CO 80294
5 Lowry CO 80230
6 Charlotte FL 33949
Merging Data Frames
We can merge two data frames by using the merge() function. The data frames must have same column names on
which the merging happens.
In the example below, we consider the data sets about Diabetes in Pima Indian Women available in the library
names "MASS". we merge the two data sets based on the values of blood pressure("bp") and body mass
index("bmi"). On choosing these two columns for merging, the records where values of these two variables match
in both data sets are combined together to form a single data frame.

library(MASS)
merged.Pima <- merge(x = Pima.te, y = Pima.tr,
by.x = c("bp", "bmi"),
by.y = c("bp", "bmi")
)
print(merged.Pima)
nrow(merged.Pima)
When we execute the above code, it produces the following result −
bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
1 60 33.8 1 117 23 0.466 27 No 2 125 20 0.088
2 64 29.7 2 75 24 0.370 33 No 2 100 23 0.368
3 64 31.2 5 189 33 0.583 29 Yes 3 158 13 0.295
4 64 33.2 4 117 27 0.230 24 No 1 96 27 0.289
5 66 38.1 3 115 39 0.150 28 No 1 114 36 0.289
6 68 38.5 2 100 25 0.324 26 No 7 129 49 0.439
7 70 27.4 1 116 28 0.204 21 No 0 124 20 0.254
8 70 33.1 4 91 32 0.446 22 No 9 123 44 0.374
9 70 35.4 9 124 33 0.282 34 No 6 134 23 0.542
10 72 25.6 1 157 21 0.123 24 No 4 99 17 0.294
11 72 37.7 5 95 33 0.370 27 No 6 103 32 0.324
12 74 25.9 9 134 33 0.460 81 No 8 126 38 0.162
13 74 25.9 1 95 21 0.673 36 No 8 126 38 0.162
14 78 27.6 5 88 30 0.258 37 No 6 125 31 0.565
15 78 27.6 10 122 31 0.512 45 No 6 125 31 0.565
16 78 39.4 2 112 50 0.175 24 No 4 112 40 0.236
17 88 34.5 1 117 24 0.403 40 Yes 4 127 11 0.598
age.y type.y
1 31 No
2 21 No
3 24 No
4 21 No
5 21 No
6 43 Yes
7 36 Yes
8 40 No
9 29 Yes
10 28 No
11 55 No
12 39 No
13 39 No
14 49 Yes
15 49 Yes
16 38 No
17 28 No
[1] 17
Melting and Casting
One of the most interesting aspects of R programming is about changing the shape of the data in multiple steps to
get a desired shape. The functions used to do this are called melt() and cast().
We consider the dataset called ships present in the library called "MASS".
library(MASS)
print(ships)
When we execute the above code, it produces the following result −
type year period service incidents
1 A 60 60 127 0
2 A 60 75 63 0
3 A 65 60 1095 3
4 A 65 75 1095 4
5 A 70 60 1512 6
.............
.............
8 A 75 75 2244 11
9 B 60 60 44882 39
10 B 60 75 17176 29
11 B 65 60 28609 58
............
............
17 C 60 60 1179 1
18 C 60 75 552 1
19 C 65 60 781 0
............
............
Melt the Data
Now we melt the data to organize it, converting all columns other than type and year into multiple rows.
molten.ships <- melt(ships, id = c("type","year"))
print(molten.ships)
When we execute the above code, it produces the following result −
type year variable value
1 A 60 period 60
2 A 60 period 75
3 A 65 period 60
4 A 65 period 75
............
............
9 B 60 period 60
10 B 60 period 75
11 B 65 period 60
12 B 65 period 75
13 B 70 period 60
...........
...........
41 A 60 service 127
42 A 60 service 63
43 A 65 service 1095
...........
...........
70 D 70 service 1208
71 D 75 service 0
72 D 75 service 2051
73 E 60 service 45
74 E 60 service 0
75 E 65 service 789
...........
...........
101 C 70 incidents 6
102 C 70 incidents 2
103 C 75 incidents 0
104 C 75 incidents 1
105 D 60 incidents 0
106 D 60 incidents 0
...........
...........
Cast the Molten Data
We can cast the molten data into a new form where the aggregate of each type of ship for each year is created. It is
done using the cast() function.
recasted.ship <- cast(molten.ships, type+year~variable,sum)
print(recasted.ship)
When we execute the above code, it produces the following result −
type year period service incidents
1 A 60 135 190 0
2 A 65 135 2190 7
3 A 70 135 4865 24
4 A 75 135 2244 11
5 B 60 135 62058 68
6 B 65 135 48979 111
7 B 70 135 20163 56
8 B 75 135 7117 18
9 C 60 135 1731 2
10 C 65 135 1457 1
11 C 70 135 2731 8
12 C 75 135 274 1
13 D 60 135 356 0
14 D 65 135 480 0
15 D 70 135 1557 13
16 D 75 135 2051 4
17 E 60 135 45 0
18 E 65 135 1226 14
19 E 70 135 3318 17
20 E 75 135 542 1
In R, we can read data from files stored outside the R environment. We can also write data into files which will be
stored and accessed by the operating system. R can read and write into various file formats like csv, excel, xml etc.

In this chapter we will learn to read data from a csv file and then write data into a csv file. The file should be
present in current working directory so that R can read it. Of course we can also set our own directory and read
files from there.
Getting and Setting the Working Directory
You can check which directory the R workspace is pointing to using the getwd() function. You can also set a new
working directory using setwd()function.

# Get and print current working directory.


print(getwd())

# Set current working directory.


setwd("/web/com")

# Get and print current working directory.


print(getwd())
When we execute the above code, it produces the following result −

[1] "/web/com/1441086124_2016"
[1] "/web/com"
This result depends on your OS and your current directory where you are working.

Input as CSV File


The csv file is a text file in which the values in the columns are separated by a comma. Let's consider the following
data present in the file named input.csv.

You can create this file using windows notepad by copying and pasting this data. Save the file as input.csv using the
save As All files(*.*) option in notepad.

id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
Reading a CSV File
Following is a simple example of read.csv() function to read a CSV file available in your current working directory −

data <- read.csv("input.csv")


print(data)
When we execute the above code, it produces the following result −

id, name, salary, start_date, dept


1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
Analyzing the CSV File
By default the read.csv() function gives the output as a data frame. This can be easily checked as follows. Also we
can check the number of columns and rows.

data <- read.csv("input.csv")

print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
When we execute the above code, it produces the following result −

[1] TRUE
[1] 5
[1] 8
Once we read data in a data frame, we can apply all the functions applicable to data frames as explained in
subsequent section.

Get the maximum salary


# Create a data frame.
data <- read.csv("input.csv")

# Get the max salary from data frame.


sal <- max(data$salary)
print(sal)
When we execute the above code, it produces the following result −

[1] 843.25
Get the details of the person with max salary
We can fetch rows meeting specific filter criteria similar to a SQL where clause.

# Create a data frame.


data <- read.csv("input.csv")

# Get the max salary from data frame.


sal <- max(data$salary)

# Get the person detail having max salary.


retval <- subset(data, salary == max(salary))
print(retval)
When we execute the above code, it produces the following result −

id name salary start_date dept


5 NA Gary 843.25 2015-03-27 Finance
Get all the people working in IT department
# Create a data frame.
data <- read.csv("input.csv")

retval <- subset( data, dept == "IT")


print(retval)
When we execute the above code, it produces the following result −

id name salary start_date dept


1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
6 6 Nina 578.0 2013-05-21 IT
Get the persons in IT department whose salary is greater than 600
# Create a data frame.
data <- read.csv("input.csv")

info <- subset(data, salary > 600 & dept == "IT")


print(info)
When we execute the above code, it produces the following result −

id name salary start_date dept


1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
Get the people who joined on or after 2014
# Create a data frame.
data <- read.csv("input.csv")

retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))


print(retval)
When we execute the above code, it produces the following result −

id name salary start_date dept


3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 NA Gary 843.25 2015-03-27 Finance
8 8 Guru 722.50 2014-06-17 Finance
Writing into a CSV File
R can create csv file form existing data frame. The write.csv() function is used to create the csv file. This file gets
created in the working directory.

# Create a data frame.


data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.


write.csv(retval,"output.csv")
newdata <- read.csv("output.csv")
print(newdata)
When we execute the above code, it produces the following result −
X id name salary start_date dept
1 3 3 Michelle 611.00 2014-11-15 IT
2 4 4 Ryan 729.00 2014-05-11 HR
3 5 NA Gary 843.25 2015-03-27 Finance
4 8 8 Guru 722.50 2014-06-17 Finance
Here the column X comes from the data set newper. This can be dropped using additional parameters while writing
the file.

# Create a data frame.


data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.


write.csv(retval,"output.csv", row.names = FALSE)
newdata <- read.csv("output.csv")
print(newdata)
When we execute the above code, it produces the following result −

id name salary start_date dept


1 3 Michelle 611.00 2014-11-15 IT
2 4 Ryan 729.00 2014-05-11 HR
3 NA Gary 843.25 2015-03-27 Finance
4 8 Guru 722.50 2014-06-17 Finance

You might also like