0% found this document useful (0 votes)

26 views

R Programming

The document discusses various R data types and structures including vectors, factors, matrices and lists. It covers creating and manipulating vectors, performing operations on vectors, and sequencing values using colon operations. Functions, data types such as numeric, integer, character, logical and date are also explained.

Uploaded by

comedynights

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

R Programming

Uploaded by

comedynights

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 48

1.1.R intro & Mathematical Operation......................................................................................3

1.2 - Function.............................................................................................................................4
2.1. Data Type............................................................................................................................6
3.1. Vectors................................................................................................................................8
3.2. Factors in R.......................................................................................................................11
3.3. Missing data......................................................................................................................12
4.1. Data Structure Intro...........................................................................................................14
4.2. Data structure – Data Frame.............................................................................................15
4.3 Data Structure Matrices......................................................................................................20
4.4 Data Structure Array..........................................................................................................23
4.5. Data Structures List...........................................................................................................25
4.6. Reading data in R..............................................................................................................34
4.7 Built in Data in R...............................................................................................................36
5.1 Basic statistics – Mean, Median.........................................................................................36
5.2. Correlation Heatmap.........................................................................................................40
5.3. Hypothesis Testing............................................................................................................45
6. Learning from Assignment..................................................................................................49

Page |1
1.1.R intro & Mathematical Operation
16/12/2020
# Useful Shortcuts
# To Clear the R-Console - Ctrl + L
# To Execute a particular line of Code - Ctrl + Enter

# Intro and Mathematical operation

a = 5 # here a is a variable and value assigned to a is 5
# 1. Command Line Interface
a

## [1] 5 # execution

b = 10
b

## [1] 10

# 2. Objects are not required to defined explicitly, while in other

languages it is required to define it.
a = 5
class(a) # To know the data type of a.

## [1] "numeric"

a = "Hello"
class(a)

## [1] "character"

a = TRUE
class(a)

## [1] "logical"

a = FALSE
class (a)

## [1] "logical"

# Object Assignments and Simple Calculations

x = 10
y = 15
x+y

## [1] 25

x-y

## [1] -5

x*y

## [1] 150

x/y

Page |2
## [1] 0.6666667

sqrt(x)

## [1] 3.162278

x^y

## [1] 1e+15

exp(x) # exponential

## [1] 22026.47

log(x, base=exp(1))

## [1] 2.302585

log10(x)

## [1] 1

factorial(x)

## [1] 3628800

cos(x)

## [1] -0.8390715

abs(x) # for absolute value of x

## [1] 10

1.2 - Function
17/12/2020

getwd() # Get Working Directory

## [1] "C:/Users/Vaibhav/Documents/R/1. R intro 15 Dec 2020"

# Functions in R
# to create a function with function name divider
divider = function(x,y) {
result = x/y
print(result)
}
divider(50,25) # x is assigned 50 and y is assigned 25

## [1] 2

Page |3
divider (100,25) # only need to assign specific values of x and y to
execute function

## [1] 4

#function for Multiplication

multiply = function(a,b){
result = a * b
print (result)
}
multiply(23,25)

## [1] 575

multiply (19,20)

## [1] 380

# Variables Names are CASE SENSITIVE (cannot use A for a)

A=10
a=24

# CONCATENATION AND ARRAYS (append and join values)

f <- c(1,2,3,4,5) # eariler "<-" this is used for assigning, now "=" is
used.
# c - combine. Combine these values as a vector.

f = c(1,2,3,4,5)
f
## [1] 1 2 3 4 5

f+4 # 4 will be added to all values

## [1] 5 6 7 8 9

d = f / 4 # All values will be divided by 4

## [1] 0.25 0.50 0.75 1.00 1.25

f+d

## [1] 1.25 2.50 3.75 5.00 6.25

f = c(1,2,3,4,5)

# Listing and Deleting Objects (Variables)

ls() # what all objects we have created

## [1] "a" "A" "d" "divider" "f" "multiply"

rm (a) # to remove particular object

ls()

## [1] "A" "d" "divider" "f" "multiply" # “a” is

removed.

Page |4
rm (list = ls()) # to remove all variables (after executing this
environment will be empty)
ls()

## character(0)

2.1. Data Type

17/12/2020

Last Topic revision

# line by Line Execution of command - Compiler
# Not explicitly declaring variables.

#A = 10
#Variable /Object -- > A (Case Sensitive)
#Value = 10
#Read from right to left.
# <- or = # Assignment.
# Simple Mathematical Operations.
# Remove the objects or variables created.

Current Topic
# 4 DATA TYPES. (Nominal, Ordinal, Interval and Ratio)
# Self (NOIR) and System (Numeric, Character, Logical, Date, Vector). (Two
Brains). We have to adjust ourselves according to R understanding
# DATA TYPES
x = 10
class(x)

## [1] "numeric"

# Numeric - Integer and Decimal - (R)- Integer (Whole Number) and Numeric
(Float - Decimal)
i = 5L # for integer we need to mention L specifically - Integer
class(i)

## [1] "integer"

is.integer(i)

## [1] TRUE

is.numeric(x)

## [1] TRUE

Page |5
# Character - Categorical Variable - Words/String (Nominal),
Classification (Gender - Male, Female)
s = "R_Studio"
class(s)

## [1] "character"

# Levels of Classification - Factor --- Involves levels.(Ordinal)

# Eg: Edu Quali - X, XII, Graduation, Post Graduation (4 Levels)

# Logical - TRUE (1) and FALSE (0)

TRUE * 5

## [1] 5 # As for R TRUE is 1, so 1*5 = 5

FALSE * 5

## [1] 0 # As for R FALSE is 0, so 0*5 = 0

K = TRUE
class(K)

## [1] "logical"

is.logical(K)

## [1] TRUE

# Date - Starting Date (1970) - Numeric Value.

# In R - 1 Jan 1970
# Date - mm/dd/yyyy
# POSIXct - Date plus Time.

date1 = as.Date("2012-06-28")
# as.Date()# Auto complete # How to enter
# ? as.Date # for help
date1

## [1] "2012-06-28"

class (date1)

## [1] "Date"

as.numeric(date1)

## [1] 15519

#POSIXct - Date and Time

date2 = as.POSIXct("2012-06-28 17:42")
date2

## [1] "2012-06-28 17:42:00 IST"

class(date2)

## [1] "POSIXct" "POSIXt"

Page |6
as.numeric(date2)

## [1] 1340885520

3.1. Vectors
17/12/2020
# Vector - R is called as Vectorized language.
# Array - n-dimension collection of similar elements
# Matrix - subset of array (Two-dimension array). Matrix generally
contains numeric values.
# Vectorized form is used by R for calculation. (used in solving Linear
regression)

# A vector is collection of elements of same type.

# (ie) A vector cannot be of mixed type.
# R is a Vectorized Language. Thant means operations are applied to each
element of the vector automatically,
# .., without the need to loop through the vector.
# This is a powerful concept and vector plays a crucial and significant
role in R.

# Creating Vectors
# The most common way to create a Vector is using 'c' [combine]
x = c(1,2,3,4,5,6,7,8,9,10) c - combine. Combine these values as a vector.
x

## [1] 1 2 3 4 5 6 7 8 9 10

# Vector Operations
x*3 # multiplies each element by 3; No loops necessary!

## [1] 3 6 9 12 15 18 21 24 27 30

x+2

## [1] 3 4 5 6 7 8 9 10 11 12

x-3

## [1] -2 -1 0 1 2 3 4 5 6 7

x/4

## [1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50

x^2

Page |7
## [1] 1 4 9 16 25 36 49 64 81 100

sqrt(x)

## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751

2.828427
## [9] 3.000000 3.162278

# colon (:) operation – Sequencing (values from one number to another)

# Creates sequence of Numbers in either direction!
1:10 # all number from 1 to 10

## [1] 1 2 3 4 5 6 7 8 9 10

10:1

## [1] 10 9 8 7 6 5 4 3 2 1

-2:3

## [1] -2 -1 0 1 2 3

5:-7

## [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7

# More on Vector Operations ... Two vectors

# create two vectors of equal length
x = 1:10
y = -5:4
x + y # Add

## [1] -4 -2 0 2 4 6 8 10 12 14

x-y

## [1] 6 6 6 6 6 6 6 6 6 6

x*y

## [1] -5 -8 -9 -8 -5 0 7 16 27 40

x/y

## [1] -0.2 -0.5 -1.0 -2.0 -5.0 Inf 7.0 4.0 3.0 2.5

x^y

## [1] 1.000000e+00 6.250000e-02 3.703704e-02 6.250000e-02 2.000000e-01

## [6] 1.000000e+00 7.000000e+00 6.400000e+01 7.290000e+02 1.000000e+04

# check the length of each vector

length(x)

## [1] 10

length(y)

## [1] 10

Page |8
# Unequal length vectors
x+c(1,2) # 1 & 2 will be added repeatedly.

## [1] 2 4 4 6 6 8 8 10 10 12

x+c (1,2,3)# If Longer vector is not "multiple" of shorter vector, a

warning is given!

## Warning in x + c(1, 2, 3): longer object length is not a multiple of

shorter
## object length

## [1] 2 4 6 5 7 9 8 10 12 11

# Comparison also work on vector!

x <= 5

## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE

x<y

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# Vector Comparison - "any" and "all"

x = 10:1
y = -4:5
any(x<y)

## [1] TRUE

all(x<y)

## [1] FALSE

# The "nchar" function also acts on each element of vector.

q = c("Hockey","Football","Baseball","Curlin","Rugby","Lacrosse",
"Basketball","Tennis","Cricket","Soccer")
q

## [1] "Hockey" "Football" "Baseball" "Curlin" "Rugby"

## [6] "Lacrosse" "Basketball" "Tennis" "Cricket" "Soccer"

nchar(q) # no. of characters

## [1] 6 8 8 6 5 8 10 6 7 6

nchar(y) # no. of digits

## [1] 2 2 2 2 1 1 1 1 1 1

# Subscripting:Accessing "individual elements" in vector is done using

square brackets []
x[1]

## [1] 10

x[1:2]

## [1] 10 9

Page |9
x[c(1:5,9)]

## [1] 10 9 8 7 6 2

# Give Names to Vector!

c(One = "a", Two = "y", Last = "r") # Name-Value pair

## One Two Last

## "a" "y" "r"

# You can Name the vector after creating vector as well!

w = 1:3
names(w) = c("a","b","c")
w

## a b c
## 1 2 3

3.2. Factors in R
17/12/2020
# Factor Vectors - Ordinal data [Ordered Categorical]
# Factors are important concept in R, esp. when building models
# Nominal - unordered (sachin, rahul), Ordinal - ordered (supervisor,
GM,AM,AGM)
# Nominal - character, Ordinal - Factor
q = c("Hockey","Lacrosse","Hockey","Water Polo","Hockey","Lacrosse")
q2 = c(q,"Hockey","Lacrosse","Hockey","Water Polo","Hockey","Lacrosse")
q2

## [1] "Hockey" "Lacrosse" "Hockey" "Water Polo" "Hockey"

## [6] "Lacrosse" "Hockey" "Lacrosse" "Hockey" "Water Polo"
## [11] "Hockey" "Lacrosse"

class(q2)

## [1] "character"

as.numeric(q2)

## Warning: NAs introduced by coercion

## [1] NA NA NA NA NA NA NA NA NA NA NA NA

class(q2)

## [1] "character"

# Converting "q2" to factor!

q2_F = as.factor(q2)
q2_F # notice the "Levels" info in the output!

P a g e | 10
## [1] Hockey Lacrosse Hockey Water Polo Hockey Lacrosse
## [7] Hockey Lacrosse Hockey Water Polo Hockey Lacrosse
## Levels: Hockey Lacrosse Water Polo

# 11 Levels - 10 Distinct Names from "q" and one (Water polo) from "q2"
# The "levels" of a factor are the unique values of that factor variable.
# Technically R is giving "unique integer" to each distinct names, See
below
as.numeric(q2_F)# IN the O/P --> Notice "6" = "Hockey"

## [1] 1 2 1 3 1 2 1 2 1 3 1 2

# numbers allotted to words on alphabatical basis.

# Ordered Levels and Un-ordered Levels

# Factors can drastically reduce the size of the variable...
# ... because they are storing only unique values!
factor(x=c("High School","College","Masters","Doctrate"),
levels = c("High School","College","Masters","Doctrate"),
ordered = TRUE)

## [1] High School College Masters Doctrate

## Levels: High School < College < Masters < Doctrate

3.3. Missing data

17/12/2020
# Missing data plays a crucial role in computing and Statistics
# R has two types of missing data - NA and NULL
# while they are similar, but they behave differently and hence needs
attention!

# NA - Missing data - Missing Value

z = c(1,2,NA,8,3,NA,3)
z

## [1] 1 2 NA 8 3 NA 3

# "is.na" tests each element of a vector for missingness

is.na(z)

## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE

#Another example
z_char = c("Hockey", NA ,"Cricket")
z_char

## [1] "Hockey" NA "Cricket"

is.na(z_char)

## [1] FALSE TRUE FALSE

P a g e | 11
# NULL - Absence of anything. It is not exactly missingness, but
nothingness
# Eg: Having Brain but thinking Nothing! - Makes Sense!!!
# Functions can sometimes return NULL and their arguments can be NULL.
# Important difference is, NULL is atomical and cannot exist within a
vector...
# ...If used inside a vector, it simply disappears! Let's see...
z= c(1,NULL,3)
z

## [1] 1 3

x = c(1,NA,3)
x

## [1] 1 NA 3

# Notice, here the "NULL" didnot get stored in "z", infact "z" has only
length of 2!
length(z)

## [1] 2 # NULL will not be counted in length

length(x)

## [1] 3 # NA will be counted in length

# Assigning NULL and checking!

d = NULL
is.null(d)

## [1] TRUE

4.1. Data Structure Intro

Vaibhav Kumar

17/12/2020
# Data Structures in R
# Data come in many types and structures which can pose a problem for
some...
# ...analysis environments but R handles them with ease.

## VECTOR
# The most common data structure is the one-dimensional vector
# Vector forms the basis of everything in R.
# A vector is collection of elements of same type.
# (ie) A vector cannot be of mixed type.
# R is a Vectorized Language. That means operations are applied to each
element of the vector automatically,
# .., without the need to loop through the vector.

P a g e | 12
# This is a powerful concept and vector plays a crucial and significant
role in R.

## DATA FRAME
# Data Frames(DF) - Most useful features of R & also cited reason for R's
ease of use.
# In dataframe, each column is actually a vector, each of which has same
length.
# Each column can hold different type of data.
# Also within each column, each element must be of same type, like
vectors.

## MATRICES
# A matrix (plural matrices) is a rectangular array or table of numbers,
symbols, or expressions...
#..., arranged in rows and columns.(i.e.) 2-Dimensional Array
# Similar to data.frame(RxC) and also similar to Vector
# Matrix - Element by element operations are possible.

## ARRAYS
# Arrays - An array is essentially a multidimensional vector.
# It must all be of the same type and
# ...individual elements are accessed using Square Brackets.
# First element is Row(R) Index, Second Element is Column(C) Index and
# the remaining elements are for Outer Dimensions (OD).

## LIST
# Lists - Stores any number of items of any type.
# List can contain all numerics or characters or...
#...a mix of the two or data.frames or recursively other lists.

# Sometimes data requires more complex storage than simple vectors.

# Data Structures - Apart from Vectors, we have Data Frames, Matrix, List
and Array.

4.2. Data structure – Data Frame

# Data Frames(DF) - Most useful features of R & also cited reason for R's
ease of use.
# In dataframe, each column is actually a vector, each of which has same
length.
# Each column can hold different type of data.
# Also within each column, each element must be of same type, like vectors

P a g e | 13
# Creating a Dataframe from vectors

x = 10:1
y = -4:5
q = c("Hockey","Football","Baseball","Curlin","Rugby","Lacrosse",
"Basketball","Tennis","Cricket","Soccer")
# to combine these 3 vectors, we will use data frame
theDF = data.frame(x,y,q) # this would create a 10x3 data.frame with x, y
and q as variable names
theDF

## x y q
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

str(theDF) # This will give the structure of dataframen like data type,
levels etc.

## 'data.frame': 10 obs. of 3 variables:

## $ x: int 10 9 8 7 6 5 4 3 2 1
## $ y: int -4 -3 -2 -1 0 1 2 3 4 5
## $ q: chr "Hockey" "Football" "Baseball" "Curlin" ...

q = as.factor(q) # to convert q into a factor and then assign it to q.

# Assigning Names to clumn varibales

theDF = data.frame (First=x, Second =y, Sport = q)
theDF

## First Second Sport

## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

# Checking the dimensions of the DF.

nrow(theDF)

## [1] 10

P a g e | 14
ncol(theDF)

## [1] 3

dim(theDF)

## [1] 10 3

names (theDF)

## [1] "First" "Second" "Sport"

names(theDF)[3] # whenever we require specific row or column, we use[].

## [1] "Sport"

rownames(theDF)

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

# Head and Tail

head(theDF)# First 6 rows with all columns

## First Second Sport

## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse

head(theDF, n=7) #first 7 rows with all columns

## First Second Sport

## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball

tail(theDF)# last six rows with all columns

## First Second Sport

## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

class(theDF)

## [1] "data.frame"

# Accessing Individual Column using $

theDF$Sport # gives the third column named Sport

P a g e | 15
## [1] Hockey Football Baseball Curlin Rugby Lacrosse
## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

# Accessing Specific row and column

theDF[3,2] # 3rd row and 2nd Column

## [1] -2

theDF[3,2:3] # 3rd Row and column 2 thru 3

## Second Sport
## 3 -2 Baseball

theDF[c(3,5), 2]# Row 3&5 from Column 2;

## [1] -2 0

# since only one column was selected, it was returned as vector and hence
no column names in output.

# Rows 3&5 and Columns 2 through 3

theDF[c(3,5), 2:3]

## Second Sport
## 3 -2 Baseball
## 5 0 Rugby

theDF[ ,3] # Access all Rows for column 3

## [1] Hockey Football Baseball Curlin Rugby Lacrosse

## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

theDF[ , 2:3] # Access all Rows for column 2 & 3

## Second Sport
## 1 -4 Hockey
## 2 -3 Football
## 3 -2 Baseball
## 4 -1 Curlin
## 5 0 Rugby
## 6 1 Lacrosse
## 7 2 Basketball
## 8 3 Tennis
## 9 4 Cricket
## 10 5 Soccer

theDF[2,]# Access all columns for Row 2

## First Second Sport

## 2 9 -3 Football

P a g e | 16
theDF[2:4,]

## First Second Sport

## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin

theDF[ , c("First", "Sport")]# access using Column Names

## First Sport
## 1 10 Hockey
## 2 9 Football
## 3 8 Baseball
## 4 7 Curlin
## 5 6 Rugby
## 6 5 Lacrosse
## 7 4 Basketball
## 8 3 Tennis
## 9 2 Cricket
## 10 1 Soccer

theDF[ ,"Sport"]# Access specific Column

## [1] Hockey Football Baseball Curlin Rugby Lacrosse

## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

class(theDF[ ,"Sport"])

## [1] "factor"

theDF["Sport"]# This returns the one column data.frame

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis
## 9 Cricket
## 10 Soccer

class(theDF["Sport"]) # Data.Frame

## [1] "data.frame"

theDF[["Sport"]]#To access Specific column using Double Square Brackets

## [1] Hockey Football Baseball Curlin Rugby Lacrosse

## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

P a g e | 17
class(theDF[["Sport"]]) # Factor

## [1] "factor"

theDF[ ,"Sport", drop = FALSE]# Use "Drop=FALSE" to get data.fame with

single sqaure bracket.

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis
## 9 Cricket
## 10 Soccer

class(theDF[ ,"Sport", drop = FALSE]) # data.frame

## [1] "data.frame"

theDF[ ,3, drop = FALSE]

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis
## 9 Cricket
## 10 Soccer

class(theDF[ ,3, drop = FALSE]) # data.frame

## [1] "data.frame"

# To see how factor is stored in data.frame

newFactor = factor(c("Pennsylvania","New York","New Jersey","New
York","Tennessee","Massachusetts","Pennsylvania","New York"))
newFactor

## [1] Pennsylvania New York New Jersey New York Tennessee

## [6] Massachusetts Pennsylvania New York

## Levels: Massachusetts New Jersey New York Pennsylvania Tennessee

# model.matrix(~newFactor -1)
# ? model.matrix() # To be understand

P a g e | 18
4.3 Data Structure Matrices
Vaibhav Kumar

17/12/2020
# A matrix (plural matrices) is a rectangular array or table of numbers,
symbols, or expressions...
#..., arranged in rows and columns.(i.e.) 2-Dimensional Array

# Similar to data.frame(RxC) and also similar to Vector

# Matrix - Element by element operations are possible

A = matrix(1:10, nrow=5)# Create a 5x2 matrix. Nrow means no. of rows

B = matrix(21:30, nrow=5)#Create another 5x2 matrix
C = matrix (21:40, nrow=2)#Create another 2x10 matrix

## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10

## [,1] [,2]
## [1,] 21 26
## [2,] 22 27
## [3,] 23 28
## [4,] 24 29
## [5,] 25 30

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 21 23 25 27 29 31 33 35 37 39
## [2,] 22 24 26 28 30 32 34 36 38 40

nrow(A)

## [1] 5

ncol(A)

## [1] 2

dim(A)

## [1] 5 2

# Add Them
A+B

P a g e | 19
## [,1] [,2]
## [1,] 22 32
## [2,] 24 34
## [3,] 26 36
## [4,] 28 38
## [5,] 30 40

# Multiply Them (Vector Multiplication!)

## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10

## [,1] [,2]
## [1,] 21 26
## [2,] 22 27
## [3,] 23 28
## [4,] 24 29
## [5,] 25 30

A*B # A = 5x2 and B = 5x2

## [,1] [,2]
## [1,] 21 156
## [2,] 44 189
## [3,] 69 224
## [4,] 96 261
## [5,] 125 300

#See if the elements are equal

A == B

## [,1] [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,] FALSE FALSE
## [5,] FALSE FALSE

# Matrix Multiplication ( 5x2 and 5X2 matrix multiplication if these

matrices is not possible).

# Matrix Multiplication(MM. A is 5x2. B is 5x2. B-transpose is 2x5

A %*% t(B) # %*% is used for matrix multiplication

## [,1] [,2] [,3] [,4] [,5]

## [1,] 177 184 191 198 205
## [2,] 224 233 242 251 260
## [3,] 271 282 293 304 315

P a g e | 20
## [4,] 318 331 344 357 370
## [5,] 365 380 395 410 425

# Naming the Columns and Rows

colnames(A)

## NULL

rownames(A)

## NULL

colnames(A)= c("Left","Right")
rownames(A)= c("1st","2nd","3rd","4th","5th")
colnames(B)

## NULL

rownames(B)

## NULL

colnames(B)= c("First","Second")
rownames(B)= c("One","Two","Three","Four","Five")
colnames(C)

## NULL

rownames(C)

## NULL

colnames(C) = LETTERS [1:10]

rownames(C) = c("Top", "Bottom")

# Matrix Multiplication. A is 5x2 and C is 2x10

dim(A)

## [1] 5 2

dim(C)

## [1] 2 10

t(A)

## 1st 2nd 3rd 4th 5th

## Left 1 2 3 4 5
## Right 6 7 8 9 10

A %*% C

## A B C D E F G H I J
## 1st 153 167 181 195 209 223 237 251 265 279
## 2nd 196 214 232 250 268 286 304 322 340 358
## 3rd 239 261 283 305 327 349 371 393 415 437
## 4th 282 308 334 360 386 412 438 464 490 516
## 5th 325 355 385 415 445 475 505 535 565 595

P a g e | 21
4.4 Data Structure Array
Vaibhav Kumar

17/12/2020
# Arrays - An array is essentially a multidimensional vector.
# It must all be of the same type and
# ...individual elements are accessed using Square Brackets.
# First element is Row(R) Index, Second Element is Column(C) Index and
# the remaining elements are for Outer Dimensions (OD).

theArray = array(1:12, dim=c(2,3,2))# Total Elements = R x C x OD

theArray

## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12

theArray [1, ,]# Accessing all elements from Row 1, all columns, all outer
dimensions & build C x OD (R x C)

## [,1] [,2]
## [1,] 1 7
## [2,] 3 9
## [3,] 5 11

theArray[1, ,1]# Accessing all elements from Row 1, all columns, first
outer dimension

## [1] 1 3 5

theArray[, ,1]# Accessing all rows, all columns, first outer dimension

## [,1] [,2] [,3]

## [1,] 1 3 5
## [2,] 2 4 6

# Array with Four Outer Dimensions (OD)

theArray_4D = array(1:32, dim=c(2,4,4))
theArray_4D

## , , 1
##

P a g e | 22
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 9 11 13 15
## [2,] 10 12 14 16
##
## , , 3
##
## [,1] [,2] [,3] [,4]
## [1,] 17 19 21 23
## [2,] 18 20 22 24
##
## , , 4
##
## [,1] [,2] [,3] [,4]
## [1,] 25 27 29 31
## [2,] 26 28 30 32

theArray_4D [1, ,] # Accessing all elements from Row 1, all columns, all
outer dimensions & build C x OD (R x C)

## [,1] [,2] [,3] [,4]

## [1,] 1 9 17 25
## [2,] 3 11 19 27
## [3,] 5 13 21 29
## [4,] 7 15 23 31

theArray_4D[1, ,1]

## [1] 1 3 5 7

theArray[, ,1]

## [,1] [,2] [,3]

## [1,] 1 3 5
## [2,] 2 4 6

4.5. Data Structures List

Vaibhav Kumar

17/12/2020
# Lists - Stores any number of items of any type.
# List can contain all numerics or characters or...
#...a mix of the two or data.frames or recursively other lists.

P a g e | 23
# Lists are created with the "list" function.
# Each argument in "list" becomes an element of the list.

list(1,2,3)# creates a three element list

## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3

list(c(1,2,3))# creates a single element(vector with three elements)

## [[1]]
## [1] 1 2 3

list3 = list(c(1,2,3), 3:7)# create two element list

# first is three elements vector, next is five element vector.
list3

## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 3 4 5 6 7

# The same can be written as

(list3 = list(c(1,2,3), 3:7))

## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 3 4 5 6 7

# Two Element list

# First element is data.frame and next is 10 element vector

x = 10:1
y = -4:5
q = c("Hockey","Football","Baseball","Curlin","Rugby","Lacrosse",
"Basketball","Tennis","Cricket","Soccer")
theDF = data.frame(x,y,q) # this would create a 10x3 data.frame with x, y
and q as variable names
theDF

## x y q
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby

P a g e | 24
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

str(theDF)# Very important - Str - Structure

## 'data.frame': 10 obs. of 3 variables:

## $ x: int 10 9 8 7 6 5 4 3 2 1
## $ y: int -4 -3 -2 -1 0 1 2 3 4 5
## $ q: chr "Hockey" "Football" "Baseball" "Curlin" ...

q = as.factor(q)

# Assigning Names
theDF = data.frame (First=x, Second =y, Sport = q)
theDF

## First Second Sport

## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

# Checking the dimensions of the DF.

nrow(theDF)

## [1] 10

ncol(theDF)

## [1] 3

dim(theDF)

## [1] 10 3

names (theDF)

## [1] "First" "Second" "Sport"

names(theDF)[3]

## [1] "Sport"

rownames(theDF)

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

P a g e | 25
# Head and Tail
head(theDF)# First 6 rows with all columns

## First Second Sport

## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse

head(theDF, n=10)

## First Second Sport

## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

tail(theDF)# last six rows with all columns

## First Second Sport

## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

class(theDF)

## [1] "data.frame"

# Accessing Individual Column using $

theDF$Sport # gives the third column named Sport

## [1] Hockey Football Baseball Curlin Rugby Lacrosse

## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

# Accessing Specific row and column

theDF[3,2] # 3rd row and 2nd Column

## [1] -2

theDF[3,2:3] # 3rd Row and column 2 thru 3

## Second Sport
## 3 -2 Baseball

P a g e | 26
theDF[c(3,5), 2]# Row 3&5 from Column 2;

## [1] -2 0

# since only one column was selected, it was returned as vector and hence
no column names in output.

# Rows 3&5 and Columns 2 through 3

theDF[c(3,5), 2:3]

## Second Sport
## 3 -2 Baseball
## 5 0 Rugby

theDF[ ,3] # Access all Rows for column 3

## [1] Hockey Football Baseball Curlin Rugby Lacrosse

## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

theDF[ , 2:3]

## Second Sport
## 1 -4 Hockey
## 2 -3 Football
## 3 -2 Baseball
## 4 -1 Curlin
## 5 0 Rugby
## 6 1 Lacrosse
## 7 2 Basketball
## 8 3 Tennis
## 9 4 Cricket
## 10 5 Soccer

theDF[2,]# Access all columns for Row 2

## First Second Sport

## 2 9 -3 Football

theDF[2:4,]

## First Second Sport

## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin

theDF[ , c("First", "Sport")]# access using Column Names

## First Sport
## 1 10 Hockey
## 2 9 Football
## 3 8 Baseball
## 4 7 Curlin
## 5 6 Rugby
## 6 5 Lacrosse
## 7 4 Basketball

P a g e | 27
## 8 3 Tennis
## 9 2 Cricket
## 10 1 Soccer

theDF[ ,"Sport"]# Access specific Column

## [1] Hockey Football Baseball Curlin Rugby Lacrosse

## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

class(theDF[ ,"Sport"])

## [1] "factor"

theDF["Sport"]# This returns the one column data.frame

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis
## 9 Cricket
## 10 Soccer

class(theDF["Sport"]) # Data.Frame

## [1] "data.frame"

theDF[["Sport"]]#To access Specific column using Double Square Brackets

## [1] Hockey Football Baseball Curlin Rugby Lacrosse

## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

class(theDF[["Sport"]]) # Factor

## [1] "factor"

theDF[ ,"Sport", drop = FALSE]# Use "Drop=FALSE" to get data.fame with

single sqaure bracket.

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis

P a g e | 28
## 9 Cricket
## 10 Soccer

class(theDF[ ,"Sport", drop = FALSE]) # data.frame

## [1] "data.frame"

theDF[ ,3, drop = FALSE]

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis
## 9 Cricket
## 10 Soccer

class(theDF[ ,3, drop = FALSE]) # data.frame

## [1] "data.frame"

# To see how factor is stored in data.frame

newFactor = factor(c("Pennsylvania","New York","New Jersey","New
York","Tennessee","Massachusetts","Pennsylvania","New York"))
newFactor

## [1] Pennsylvania New York New Jersey New York Tennessee

## [6] Massachusetts Pennsylvania New York

## Levels: Massachusetts New Jersey New York Pennsylvania Tennessee

# model.matrix(~newFactor -1)
# ? model.matrix()
list(theDF, 1:10)# theDF is already created in previous exercise!

## [[1]]
## First Second Sport
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10

P a g e | 29
# Three element list
list5 = list(theDF, 1:10, list3)
list5

#Naming List (similar to column name in data.frame)

names(list5)= c("data.frame", "vector","list")
names(list5)

## [1] "data.frame" "vector" "list"

list5

## $data.frame
## First Second Sport
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer
##
## $vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $list
## $list[[1]]
## [1] 1 2 3
##

P a g e | 30
## $list[[2]]
## [1] 3 4 5 6 7

#Naming using "Name-Value" pair

list6 = list(TheDataFrame = theDF, TheVector = 1:10, TheList = list3)
names(list6)

## [1] "TheDataFrame" "TheVector" "TheList"

list6

## $TheDataFrame
## First Second Sport
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer
##
## $TheVector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $TheList
## $TheList[[1]]
## [1] 1 2 3
##
## $TheList[[2]]
## [1] 3 4 5 6 7

# Creating an empty list

(emptylist = vector(mode="list", length =4))

## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL

# Accessing individual element of a list - Double Square Brackets

# specify either element number or name
list5[[1]]

## First Second Sport

## 1 10 -4 Hockey

P a g e | 31
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

list5[["data.frame"]]

## First Second Sport

## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

list5[[1]]$Sport

## [1] Hockey Football Baseball Curlin Rugby Lacrosse

## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

list5[[1]][,"Second"]

## [1] -4 -3 -2 -1 0 1 2 3 4 5

list5[[1]][,"Second", drop = FALSE]

## Second
## 1 -4
## 2 -3
## 3 -2
## 4 -1
## 5 0
## 6 1
## 7 2
## 8 3
## 9 4
## 10 5

# LENGTH OF LIST
length(list5)

## [1] 3

names(list5)

P a g e | 32
## [1] "data.frame" "vector" "list"

list5

4.6. Reading data in R

Vaibhav Kumar

18/12/2020
# Its time that we load data in R.
# Most common way to get data is reading comma separated values(CSV)

# Reading CSVs
#theUrl = "https://www.jaredlander.com/data/RetailFood.csv"
# visit https://www.jaredlander.com/data/ for other Datasets
#RetailFood = read.table(file=theUrl, header=TRUE, sep =",") here values
are separated by “,” and header to read the header
#head(RetailFood) # to read first 6 rows with all columns

#We can also use read.csv instead of read.table but it will work if file
is of csv extension. It might be tempting to use read.csv but that is more
trouble than it is worth,
#...and all it does is call read.table with some arguments preset.

# Sometimes CSVs(or tab delimited files) are poorly built,

# where the cell separator has been used inside a cell.

P a g e | 33
# In this case read.csv2(or read.delim2)should be used instead of
read.table.

# Reading Excel Data - Not worth the Effort.

# Unfortunately, it is difficult to read Excel data into R - Requires
additional packages to be installed.
# Convert into CSV and read.

# Reading Text Files

#Garden = read.table("C:\\Users\\Vaibhav\\Documents\\R\\Data Structure
\\data.txt",header=TRUE,sep="")
#head(Garden)
# We cannot use “\” here, as it is assigned for other purpose in R.
Instead of “\” we can use “/” or “\\”.

#R Binary Files
# save the tomato data.frame to Disk
#save(RetailFood, file="C:\\Users\\Vaibhav\\Documents\\R\\Data
Structure\\RetailFood.rdata")
# remove tomato from memory
#rm(RetailFood)
# Check if it still exists
#head(RetailFood)
# read it from the rdata file
#load("C:\\Users\\Vaibhav\\Documents\\R\\Data
Structure\\RetailFood.rdata")
#head(RetailFood)

# Read data from anywhere in the Disk/Computer

# myData = read.csv(file.choose()) # No working directory setup is needed.
# but if we use file.choose, there are issues with header.

4.7 Built in Data in R

Vaibhav Kumar

18/12/2020
# R has various packages which need to be install and also contain various
data sets
# Built-in datasets in R
# for example
# data()# List of built-in Datasets in R. Open in different tab.

# Loading
# data(mtcars)
# Print the first 6 rows
# head(mtcars, 6)

P a g e | 34
5.1 Basic statistics – Mean, Median
Vaibhav Kumar

18/12/2020
# Basic Statistics - Mean, Variances,Correlations and T-tests

# Generate a random sample of 100 numbers between 1 and 100

x = sample(x=1:100, size = 20, replace = TRUE) # true is used for repeat
value and for unique value false is used
x # the output of "x" is a vector of data

## [1] 96 71 18 26 15 10 18 39 28 68 58 13 25 38 57 95 60 22 89 93

# Simple Arithmetic Mean

mean(x)

## [1] 46.95

# Calculate Mean when Missing Data is found

y = x # copy x to y
y = sample(x=1:100, size = 20, replace = FALSE) #= NA # Null Values
y

## [1] 84 57 86 18 12 48 23 93 69 20 29 70 7 66 82 76 65 99 67 56

y = sample(x=1:100, size = 20, replace = FALSE)

## [1] 49 63 30 73 61 53 72 38 94 70 41 33 60 58 75 47 95 11 39 46

mean(y)# Will give NA! because sample contains both numerical and
character (NA)

## [1] 55.4

# Remove missing value(s)and calculate mean

mean(y, na.rm=TRUE) # Now, it will give the mean value

## [1] 55.4

# Weighted Mean
Grades = c(65,90,54,78)
Weights = c(1/8, 1/8, 1/4, 1/2)
mean(Grades)# Simple Arithmetic mean

## [1] 71.75

weighted.mean(x = Grades, w = Weights)# Weighted Mean

## [1] 71.875

#Variance
var(x)

## [1] 909.4184

P a g e | 35
#Calculating Variance using formula!
sum((x-mean(x))^2)/ (length(x)-1)

## [1] 909.4184

# Standard Deviation
sqrt(var(x)) #square root of variance

## [1] 30.15657

sd(x)

## [1] 30.15657

sd(y)

## [1] 21.10226

sd(y, na.rm=TRUE) # to remove NA and calculate standard deviation

## [1] 21.10226

# Other Commonly Used Functions

min(x)

## [1] 10

max(x)

## [1] 96

median(x)

## [1] 38.5

min(y)

## [1] 11

min(y, na.rm=TRUE)

## [1] 11

# Summary Statistics
summary(x) # it provides min, max, median, mean, 1st qu. and 3rd qu.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 10.00 21.00 38.50 46.95 68.75 96.00

summary(y) # BOX PLOT

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 11.0 40.5 55.5 55.4 70.5 95.0

# Quantiles
quantile(x, probs = c(0.25, 0.75)) # Calculate 25th and 75th Quantile

## 25% 75%
## 21.00 68.75

P a g e | 36
quantile(x, probs = c(0.1,0.25,0.5, 0.75,0.99)) # to calculate value at
specific length

## 10% 25% 50% 75% 99%

## 14.80 21.00 38.50 68.75 95.81

quantile(y, probs = c(0.25, 0.75)) # Calculate 25th and 75th Quantile

## 25% 75%
## 40.5 70.5

quantile(y, probs = c(0.25, 0.75), na.rm = TRUE)

## 25% 75%
## 40.5 70.5

# Correlation and Covariance

#install.packages("ggplot2")
library(ggplot2)# require(ggplot2)
head(economics)# Built-in dataset in ggplot2 package

## # A tibble: 6 x 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
## 4 1967-10-01 512. 199311 12.9 4.9 3143
## 5 1967-11-01 517. 199498 12.8 4.7 3066
## 6 1967-12-01 525. 199657 11.8 4.8 3018

cor(economics$pce, economics$psavert) #pce-Personal Consumption

Expenditure;psavert -Personal Savings Rate

## [1] -0.7928546

# To compare correlation for Multiple variables

cor(economics[, c(2,4:6)]) #correlation between 2,4,5,6 columns

## pce psavert uempmed unemploy

## pce 1.0000000 -0.7928546 0.7269616 0.6145176
## psavert -0.7928546 1.0000000 -0.3251377 -0.3093769
## uempmed 0.7269616 -0.3251377 1.0000000 0.8693097
## unemploy 0.6145176 -0.3093769 0.8693097 1.0000000

# Display Correlation in Different Format!

# Lets install the required package and load them onto this R environment
for executing!!!

# Load the "reshape" package

#install.packages("reshape2")
require(reshape2)

## Loading required package: reshape2

P a g e | 37
# Also load the Scales package for some extra plotting features
#install.packages("scales")
library(scales)

econCor = cor(economics [ , c(2,4:6)])

# use "melt()" to change into long format
#?melt() # Help on melt function
econMelt = melt(econCor, varnames = c("x" ,"y"), value.name =
"Correlation")
# Order it according to correlation
econMelt = econMelt[order(econMelt$Correlation),]
# Display the melted data
econMelt

## x y Correlation
## 2 psavert pce -0.7928546
## 5 pce psavert -0.7928546
## 7 uempmed psavert -0.3251377
## 10 psavert uempmed -0.3251377
## 8 unemploy psavert -0.3093769
## 14 psavert unemploy -0.3093769
## 4 unemploy pce 0.6145176
## 13 pce unemploy 0.6145176
## 3 uempmed pce 0.7269616
## 9 pce uempmed 0.7269616
## 12 unemploy uempmed 0.8693097
## 15 uempmed unemploy 0.8693097
## 1 pce pce 1.0000000
## 6 psavert psavert 1.0000000
## 11 uempmed uempmed 1.0000000
## 16 unemploy unemploy 1.0000000

# Let's Visualize Correlation

## Plot it with ggplot
# Initialize the plot with x and y on the respective axes
ggplot(econMelt,aes (x=x, y=y),geom_tile(aes(fill =
Correlation)),scale_fill_gradient2(low = muted("red"), mid = "white", high
= "steelblue",guide = guide_colorbar(ticks=FALSE, barheight=10), limit=c(-
1,1), theme_minimal(), labs(x= NULL, y=NULL)))

P a g e | 38
5.2. Correlation Heatmap
Vaibhav Kumar

18/12/2020
# Correlation

# Prepare the Data

mydata <- mtcars[, c(1,3,4,5,6,7)]
head(mydata)

## mpg disp hp drat wt qsec

## Mazda RX4 21.0 160 110 3.90 2.620 16.46
## Mazda RX4 Wag 21.0 160 110 3.90 2.875 17.02
## Datsun 710 22.8 108 93 3.85 2.320 18.61
## Hornet 4 Drive 21.4 258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7 360 175 3.15 3.440 17.02
## Valiant 18.1 225 105 2.76 3.460 20.22

# Compute the correlation matrix - cor()

cormat <- round(cor(mydata),2)
head(cormat)

## mpg disp hp drat wt qsec

## mpg 1.00 -0.85 -0.78 0.68 -0.87 0.42
## disp -0.85 1.00 0.79 -0.71 0.89 -0.43

P a g e | 39
## hp -0.78 0.79 1.00 -0.45 0.66 -0.71
## drat 0.68 -0.71 -0.45 1.00 -0.71 0.09
## wt -0.87 0.89 0.66 -0.71 1.00 -0.17
## qsec 0.42 -0.43 -0.71 0.09 -0.17 1.00

# Create the correlation heatmap with ggplot2

# The package reshape is required to melt the correlation matrix.
library(reshape2)
melted_cormat <- melt(cormat)
head(melted_cormat)

## Var1 Var2 value

## 1 mpg mpg 1.00
## 2 disp mpg -0.85
## 3 hp mpg -0.78
## 4 drat mpg 0.68
## 5 wt mpg -0.87
## 6 qsec mpg 0.42

#The function geom_tile()[ggplot2 package] is used to visualize the

correlation matrix :
library(ggplot2)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()

#Doesnot Look Great.. Let's Enhance the viz!

#Get the lower and upper triangles of the correlation matrix

## a correlation matrix has redundant information. We'll use the functions
below to set half of it to NA.

P a g e | 40
# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}

upper_tri <- get_upper_tri(cormat)

upper_tri

## mpg disp hp drat wt qsec

## mpg 1 -0.85 -0.78 0.68 -0.87 0.42
## disp NA 1.00 0.79 -0.71 0.89 -0.43
## hp NA NA 1.00 -0.45 0.66 -0.71
## drat NA NA NA 1.00 -0.71 0.09
## wt NA NA NA NA 1.00 -0.17
## qsec NA NA NA NA NA 1.00

# Finished correlation matrix heatmap

## Melt the correlation data and drop the rows with NA values
# Melt the correlation matrix
library(reshape2)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()

P a g e | 41
# negative correlations are in blue color and positive correlations in
red.
# The function scale_fill_gradient2 is used with the argument limit = c(-
1,1) as correlation coefficients range from -1 to 1.
# coord_fixed() : this function ensures that one unit on the x-axis is the
same length as one unit on the y-axis.

# Reorder the correlation matrix

# This section describes how to reorder the correlation matrix according

to the correlation coefficient.
# This is useful to identify the hidden pattern in the matrix.
# hclust for hierarchical clustering order is used in the example below.

reorder_cormat <- function(cormat){

# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}

# Reorder the correlation matrix

cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",

P a g e | 42
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+ # minimal theme
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
# Print the heatmap
print(ggheatmap)

#Add correlation coefficients on the heatmap

## Use geom_text() to add the correlation coefficients on the graph

## Use a blank theme (remove axis labels, panel grids and background, and
axis ticks)
## Use guides() to change the position of the legend title

ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))

P a g e | 43
5.3. Hypothesis Testing
Vaibhav Kumar

18/12/2020
# T-tests
# Dataset: Tips dependents on...
data(tips, package = "reshape2")
head(tips)

## total_bill tip sex smoker day time size

## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4

str(tips) # to find out no. of levels

## 'data.frame': 244 obs. of 7 variables:

## $ total_bill: num 17 10.3 21 23.7 24.6 ...
## $ tip : num 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2
2 ...
## $ smoker : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3
3 3 ...
## $ time : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1

P a g e | 44
...
## $ size : int 2 3 3 2 4 4 2 4 2 2 ...

write.csv(tips, "C:\\Users\\Vaibhav\\Documents\\R\\Basic
statistics\\tips.csv", row.names = FALSE) # to save tips file in excel in
computer

# Gender
unique(tips$sex) # levels

## [1] Female Male

## Levels: Female Male

#Day of the week

unique(tips$day) # levels

## [1] Sun Sat Thur Fri

## Levels: Fri Sat Sun Thur

#One Sample t-test - ONE GROUP [Two Tail. Ho:Mean = 2.5]

t.test(tips$tip, alternative = "two.sided", mu=2.5)

##
## One Sample t-test
##
## data: tips$tip
## t = 5.6253, df = 243, p-value = 5.08e-08
## alternative hypothesis: true mean is not equal to 2.5
## 95 percent confidence interval:
## 2.823799 3.172758
## sample estimates:
## mean of x
## 2.998279

# Result - p value is less than 0.05, so we reject the null hypothesis.

Mean is not equal to 2.5.

#One Sample t-test - Upper Tail. Ho:Mean LE 2.5

t.test(tips$tip, alternative = "greater", mu=2.5)

##
## One Sample t-test
##
## data: tips$tip
## t = 5.6253, df = 243, p-value = 2.54e-08
## alternative hypothesis: true mean is greater than 2.5
## 95 percent confidence interval:
## 2.852023 Inf
## sample estimates:
## mean of x
## 2.998279

# Result - p value is less than 0.05, so we reject the null hypothesis.

Mean is not equal to 2.5.

# Two Sample T-test - TWO GROUP

P a g e | 45
t.test(tip ~ sex, data = tips, var.equal = TRUE) # Male and Female are
independent. Assuming variance for both is equal.

##
## Two Sample t-test
##
## data: tip by sex
## t = -1.3879, df = 242, p-value = 0.1665
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6197558 0.1074167
## sample estimates:
## mean in group Female mean in group Male
## 2.833448 3.089618

# Result - P value is greater than 0.05, so we accept null hypothesis.

Hence average tip given by male and female are same.

#Paired Two-Sample T-Test

# Dataset: Heights of Father and Son (Package:UsingR)

# used when samples are dependent

#install.packages("UsingR")
require(UsingR)

## Loading required package: UsingR

## Loading required package: MASS

## Loading required package: HistData

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

##
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':

##
## format.pval, units

##
## Attaching package: 'UsingR'

## The following object is masked from 'package:survival':

##
## cancer

head(father.son)

P a g e | 46
## fheight sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221

write.csv(father.son, "C:\\Users\\Vaibhav\\Documents\\R\\Basic
statistics\\father_son.csv", row.names = FALSE)

#ANOVA – Analysis of variance

#ANOVA - Comparing Multiple Groups

# Tip by the Day of the Week
str(tips)

## 'data.frame': 244 obs. of 7 variables:

tipAnova = aov(tip ~ day, tips) # to compare whether tip on all days is

same or not.
summary(tipAnova)

## Df Sum Sq Mean Sq F value Pr(>F)

## day 3 9.5 3.175 1.672 0.174
## Residuals 240 455.7 1.899

# Day DF is c-1 and residuals DF is no. of samples – c. C = 4, so number

of samples is 244. And F value = 3.175/1.899

P a g e | 47
6. Learning from Assignment

 Understand R coding basics.

 Benefits of R over other languages.
 Understand use of vector in R.
 Learn Reading Data from external source
 Understand Different Types of Data Structure (Data Frame, Matrices,
Array & List).
 Understand how to use various data sets in available
in R.

 Understand null and alternates hypothesis.

 Understand how to use T-test and ANOVA in R.

P a g e | 48

Lesson Plan C#
No ratings yet
Lesson Plan C#
9 pages
R Programming For Data Science
No ratings yet
R Programming For Data Science
362 pages
Normalization Example: Project Management Report
No ratings yet
Normalization Example: Project Management Report
3 pages
Bellman Ford
No ratings yet
Bellman Ford
9 pages
R Programming - Lecture3
No ratings yet
R Programming - Lecture3
30 pages
Introduction To R
No ratings yet
Introduction To R
34 pages
Introduction to r Chap 2
No ratings yet
Introduction to r Chap 2
30 pages
R Programming
No ratings yet
R Programming
22 pages
R Basics: Installing R
No ratings yet
R Basics: Installing R
9 pages
Basic-coding-syntax-and-structure-in-R---version-2
No ratings yet
Basic-coding-syntax-and-structure-in-R---version-2
19 pages
Comp 1 HW
No ratings yet
Comp 1 HW
11 pages
R For Absolute Beginners - Hands-On R Tutorial: June 2018
No ratings yet
R For Absolute Beginners - Hands-On R Tutorial: June 2018
43 pages
DSF Gourav-2
No ratings yet
DSF Gourav-2
30 pages
R - Programming - Moduel 1 - Module 4
No ratings yet
R - Programming - Moduel 1 - Module 4
88 pages
4 R and RStudio 2
No ratings yet
4 R and RStudio 2
20 pages
Statistical Computing II-slide (1)
No ratings yet
Statistical Computing II-slide (1)
279 pages
R Course Notes
No ratings yet
R Course Notes
10 pages
r File Finall
No ratings yet
r File Finall
75 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
RBigData NTL
No ratings yet
RBigData NTL
24 pages
Practical 1- Basics of R
No ratings yet
Practical 1- Basics of R
8 pages
First Course On R
No ratings yet
First Course On R
26 pages
Week 1-R Programming Notes
No ratings yet
Week 1-R Programming Notes
15 pages
Introduction To R: 1 Getting Started
No ratings yet
Introduction To R: 1 Getting Started
14 pages
Rintro
No ratings yet
Rintro
14 pages
Chapter 1 Introduction To R
No ratings yet
Chapter 1 Introduction To R
33 pages
R Intro A Firsts Steps
No ratings yet
R Intro A Firsts Steps
112 pages
Introduction To R and Rstudio, R Script, Calling Functions, Running Code
No ratings yet
Introduction To R and Rstudio, R Script, Calling Functions, Running Code
10 pages
R Lab
No ratings yet
R Lab
114 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
109 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
109 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
109 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
109 pages
The Art of R Programming
100% (2)
The Art of R Programming
193 pages
BS51009 workshop 1
No ratings yet
BS51009 workshop 1
15 pages
R For Statistics PDF
86% (7)
R For Statistics PDF
312 pages
MKT4080 Review Notes-R Part
No ratings yet
MKT4080 Review Notes-R Part
13 pages
R Programming Checklist of Basic Skills With Examples
No ratings yet
R Programming Checklist of Basic Skills With Examples
33 pages
R Studio
No ratings yet
R Studio
41 pages
R-Unit 2
No ratings yet
R-Unit 2
81 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Source Code 1
No ratings yet
Source Code 1
40 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
R
No ratings yet
R
13 pages
Introduction To R
No ratings yet
Introduction To R
19 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
109 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
109 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Core Team
109 pages
Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics
No ratings yet
Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics
58 pages
BRM PRACTICAL FILE H--
No ratings yet
BRM PRACTICAL FILE H--
37 pages
An Introduction To R: W. N. Venables, D. M. Smith and The R Development Core Team
No ratings yet
An Introduction To R: W. N. Venables, D. M. Smith and The R Development Core Team
100 pages
R Intro
No ratings yet
R Intro
109 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
More on C# in Front Office
From Everand
More on C# in Front Office
Xing Zhou
No ratings yet
Algebra, Grades 5 - 12
From Everand
Algebra, Grades 5 - 12
Carson Dellosa Education
No ratings yet
Math Starters: 5- to 10-Minute Activities Aligned with the Common Core Math Standards, Grades 6-12
From Everand
Math Starters: 5- to 10-Minute Activities Aligned with the Common Core Math Standards, Grades 6-12
Gary R. Muschla
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
Core Concepts in Real Analysis
From Everand
Core Concepts in Real Analysis
Roshan Trivedi
No ratings yet
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
Fast mental calculation tricks
From Everand
Fast mental calculation tricks
EasyMath
No ratings yet
Oneapi Rendering Toolkit Get Started Guide Windows 2023.2 766442 781968
No ratings yet
Oneapi Rendering Toolkit Get Started Guide Windows 2023.2 766442 781968
26 pages
Java Lang IMP Questions by VJTech Academy
No ratings yet
Java Lang IMP Questions by VJTech Academy
84 pages
Python Module_5 Notes (Revised)
No ratings yet
Python Module_5 Notes (Revised)
22 pages
Tcode and TB
No ratings yet
Tcode and TB
16 pages
[1] Hybrid Lockstep Technique for Soft Error Mitigation
No ratings yet
[1] Hybrid Lockstep Technique for Soft Error Mitigation
8 pages
MOS-Mod7-1 IPC
No ratings yet
MOS-Mod7-1 IPC
11 pages
CS506 MIDTERM SOLVED MCQS by JUNAID
100% (1)
CS506 MIDTERM SOLVED MCQS by JUNAID
51 pages
Definition:: Python Programming
No ratings yet
Definition:: Python Programming
17 pages
CH 10
No ratings yet
CH 10
31 pages
Brug-10 01
No ratings yet
Brug-10 01
45 pages
Python Lab
100% (3)
Python Lab
314 pages
12 CS Preboard Set-I QP 2023-24
No ratings yet
12 CS Preboard Set-I QP 2023-24
8 pages
8.Object-Oriented Modelling Design
No ratings yet
8.Object-Oriented Modelling Design
49 pages
C++ Lecture Notes 3
No ratings yet
C++ Lecture Notes 3
18 pages
Multitasking Operation SYSTEM - 2023EV149: Dharsan S (201CS146)
No ratings yet
Multitasking Operation SYSTEM - 2023EV149: Dharsan S (201CS146)
8 pages
Assignment 1 Instruction
No ratings yet
Assignment 1 Instruction
3 pages
IText Jumpstart Tutorial
No ratings yet
IText Jumpstart Tutorial
65 pages
SAP Closing Cockpit Asdddddasd2132
No ratings yet
SAP Closing Cockpit Asdddddasd2132
12 pages
S88 Standard in Batch and Continuous Process Plants With DeltaV
No ratings yet
S88 Standard in Batch and Continuous Process Plants With DeltaV
5 pages
Apa
No ratings yet
Apa
22 pages
FOCAS2/Ethernet For Linux Operator'S Manual
No ratings yet
FOCAS2/Ethernet For Linux Operator'S Manual
15 pages
Compiler Design and Construction
No ratings yet
Compiler Design and Construction
14 pages
GPU Computing Revolution CUDA
100% (1)
GPU Computing Revolution CUDA
5 pages
Dsa - Lab - Journal - Write Up
No ratings yet
Dsa - Lab - Journal - Write Up
48 pages
Lab 6 Q
No ratings yet
Lab 6 Q
5 pages
Chapter 5 - Syntax Directed Translation
No ratings yet
Chapter 5 - Syntax Directed Translation
36 pages
Pramesh Malu: Flat No: F1-303, Leeds Enclave, Indore, M.P
No ratings yet
Pramesh Malu: Flat No: F1-303, Leeds Enclave, Indore, M.P
4 pages