Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
26 views

R Programming

The document discusses various R data types and structures including vectors, factors, matrices and lists. It covers creating and manipulating vectors, performing operations on vectors, and sequencing values using colon operations. Functions, data types such as numeric, integer, character, logical and date are also explained.

Uploaded by

comedynights
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

R Programming

The document discusses various R data types and structures including vectors, factors, matrices and lists. It covers creating and manipulating vectors, performing operations on vectors, and sequencing values using colon operations. Functions, data types such as numeric, integer, character, logical and date are also explained.

Uploaded by

comedynights
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Table of Contents

1.1.R intro & Mathematical Operation......................................................................................3


1.2 - Function.............................................................................................................................4
2.1. Data Type............................................................................................................................6
3.1. Vectors................................................................................................................................8
3.2. Factors in R.......................................................................................................................11
3.3. Missing data......................................................................................................................12
4.1. Data Structure Intro...........................................................................................................14
4.2. Data structure – Data Frame.............................................................................................15
4.3 Data Structure Matrices......................................................................................................20
4.4 Data Structure Array..........................................................................................................23
4.5. Data Structures List...........................................................................................................25
4.6. Reading data in R..............................................................................................................34
4.7 Built in Data in R...............................................................................................................36
5.1 Basic statistics – Mean, Median.........................................................................................36
5.2. Correlation Heatmap.........................................................................................................40
5.3. Hypothesis Testing............................................................................................................45
6. Learning from Assignment..................................................................................................49

Page |1
1.1.R intro & Mathematical Operation
16/12/2020
# Useful Shortcuts
# To Clear the R-Console - Ctrl + L
# To Execute a particular line of Code - Ctrl + Enter

# Intro and Mathematical operation


a = 5 # here a is a variable and value assigned to a is 5
# 1. Command Line Interface
a

## [1] 5 # execution

b = 10
b

## [1] 10

# 2. Objects are not required to defined explicitly, while in other


languages it is required to define it.
a = 5
class(a) # To know the data type of a.

## [1] "numeric"

a = "Hello"
class(a)

## [1] "character"

a = TRUE
class(a)

## [1] "logical"

a = FALSE
class (a)

## [1] "logical"

# Object Assignments and Simple Calculations


x = 10
y = 15
x+y

## [1] 25

x-y

## [1] -5

x*y

## [1] 150

x/y

Page |2
## [1] 0.6666667

sqrt(x)

## [1] 3.162278

x^y

## [1] 1e+15

exp(x) # exponential

## [1] 22026.47

log(x, base=exp(1))

## [1] 2.302585

log10(x)

## [1] 1

factorial(x)

## [1] 3628800

cos(x)

## [1] -0.8390715

abs(x) # for absolute value of x

## [1] 10

1.2 - Function
17/12/2020

getwd() # Get Working Directory

## [1] "C:/Users/Vaibhav/Documents/R/1. R intro 15 Dec 2020"

# Functions in R
# to create a function with function name divider
divider = function(x,y) {
result = x/y
print(result)
}
divider(50,25) # x is assigned 50 and y is assigned 25

## [1] 2

Page |3
divider (100,25) # only need to assign specific values of x and y to
execute function

## [1] 4

#function for Multiplication


multiply = function(a,b){
result = a * b
print (result)
}
multiply(23,25)

## [1] 575

multiply (19,20)

## [1] 380

# Variables Names are CASE SENSITIVE (cannot use A for a)


A=10
a=24

# CONCATENATION AND ARRAYS (append and join values)


f <- c(1,2,3,4,5) # eariler "<-" this is used for assigning, now "=" is
used.
# c - combine. Combine these values as a vector.

f = c(1,2,3,4,5)
f
## [1] 1 2 3 4 5

f+4 # 4 will be added to all values

## [1] 5 6 7 8 9

d = f / 4 # All values will be divided by 4


d

## [1] 0.25 0.50 0.75 1.00 1.25

f+d

## [1] 1.25 2.50 3.75 5.00 6.25

f = c(1,2,3,4,5)

# Listing and Deleting Objects (Variables)


ls() # what all objects we have created

## [1] "a" "A" "d" "divider" "f" "multiply"

rm (a) # to remove particular object


ls()

## [1] "A" "d" "divider" "f" "multiply" # “a” is


removed.

Page |4
rm (list = ls()) # to remove all variables (after executing this
environment will be empty)
ls()

## character(0)

2.1. Data Type


17/12/2020

Last Topic revision


# line by Line Execution of command - Compiler
# Not explicitly declaring variables.

#A = 10
#Variable /Object -- > A (Case Sensitive)
#Value = 10
#Read from right to left.
# <- or = # Assignment.
# Simple Mathematical Operations.
# Remove the objects or variables created.

Current Topic
# 4 DATA TYPES. (Nominal, Ordinal, Interval and Ratio)
# Self (NOIR) and System (Numeric, Character, Logical, Date, Vector). (Two
Brains). We have to adjust ourselves according to R understanding
# DATA TYPES
x = 10
class(x)

## [1] "numeric"

# Numeric - Integer and Decimal - (R)- Integer (Whole Number) and Numeric
(Float - Decimal)
i = 5L # for integer we need to mention L specifically - Integer
class(i)

## [1] "integer"

is.integer(i)

## [1] TRUE

is.numeric(x)

## [1] TRUE

Page |5
# Character - Categorical Variable - Words/String (Nominal),
Classification (Gender - Male, Female)
s = "R_Studio"
class(s)

## [1] "character"

# Levels of Classification - Factor --- Involves levels.(Ordinal)


# Eg: Edu Quali - X, XII, Graduation, Post Graduation (4 Levels)

# Logical - TRUE (1) and FALSE (0)


TRUE * 5

## [1] 5 # As for R TRUE is 1, so 1*5 = 5

FALSE * 5

## [1] 0 # As for R FALSE is 0, so 0*5 = 0

K = TRUE
class(K)

## [1] "logical"

is.logical(K)

## [1] TRUE

# Date - Starting Date (1970) - Numeric Value.


# In R - 1 Jan 1970
# Date - mm/dd/yyyy
# POSIXct - Date plus Time.

date1 = as.Date("2012-06-28")
# as.Date()# Auto complete # How to enter
# ? as.Date # for help
date1

## [1] "2012-06-28"

class (date1)

## [1] "Date"

as.numeric(date1)

## [1] 15519

#POSIXct - Date and Time


date2 = as.POSIXct("2012-06-28 17:42")
date2

## [1] "2012-06-28 17:42:00 IST"

class(date2)

## [1] "POSIXct" "POSIXt"

Page |6
as.numeric(date2)

## [1] 1340885520

3.1. Vectors
17/12/2020
# Vector - R is called as Vectorized language.
# Array - n-dimension collection of similar elements
# Matrix - subset of array (Two-dimension array). Matrix generally
contains numeric values.
# Vectorized form is used by R for calculation. (used in solving Linear
regression)

# A vector is collection of elements of same type.


# (ie) A vector cannot be of mixed type.
# R is a Vectorized Language. Thant means operations are applied to each
element of the vector automatically,
# .., without the need to loop through the vector.
# This is a powerful concept and vector plays a crucial and significant
role in R.

# Creating Vectors
# The most common way to create a Vector is using 'c' [combine]
x = c(1,2,3,4,5,6,7,8,9,10) c - combine. Combine these values as a vector.
x

## [1] 1 2 3 4 5 6 7 8 9 10

# Vector Operations
x*3 # multiplies each element by 3; No loops necessary!

## [1] 3 6 9 12 15 18 21 24 27 30

x+2

## [1] 3 4 5 6 7 8 9 10 11 12

x-3

## [1] -2 -1 0 1 2 3 4 5 6 7

x/4

## [1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50

x^2

Page |7
## [1] 1 4 9 16 25 36 49 64 81 100

sqrt(x)

## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751


2.828427
## [9] 3.000000 3.162278

# colon (:) operation – Sequencing (values from one number to another)


# Creates sequence of Numbers in either direction!
1:10 # all number from 1 to 10

## [1] 1 2 3 4 5 6 7 8 9 10

10:1

## [1] 10 9 8 7 6 5 4 3 2 1

-2:3

## [1] -2 -1 0 1 2 3

5:-7

## [1] 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7

# More on Vector Operations ... Two vectors


# create two vectors of equal length
x = 1:10
y = -5:4
x + y # Add

## [1] -4 -2 0 2 4 6 8 10 12 14

x-y

## [1] 6 6 6 6 6 6 6 6 6 6

x*y

## [1] -5 -8 -9 -8 -5 0 7 16 27 40

x/y

## [1] -0.2 -0.5 -1.0 -2.0 -5.0 Inf 7.0 4.0 3.0 2.5

x^y

## [1] 1.000000e+00 6.250000e-02 3.703704e-02 6.250000e-02 2.000000e-01


## [6] 1.000000e+00 7.000000e+00 6.400000e+01 7.290000e+02 1.000000e+04

# check the length of each vector


length(x)

## [1] 10

length(y)

## [1] 10

Page |8
# Unequal length vectors
x+c(1,2) # 1 & 2 will be added repeatedly.

## [1] 2 4 4 6 6 8 8 10 10 12

x+c (1,2,3)# If Longer vector is not "multiple" of shorter vector, a


warning is given!

## Warning in x + c(1, 2, 3): longer object length is not a multiple of


shorter
## object length

## [1] 2 4 6 5 7 9 8 10 12 11

# Comparison also work on vector!


x <= 5

## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE

x<y

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# Vector Comparison - "any" and "all"


x = 10:1
y = -4:5
any(x<y)

## [1] TRUE

all(x<y)

## [1] FALSE

# The "nchar" function also acts on each element of vector.


q = c("Hockey","Football","Baseball","Curlin","Rugby","Lacrosse",
"Basketball","Tennis","Cricket","Soccer")
q

## [1] "Hockey" "Football" "Baseball" "Curlin" "Rugby"


## [6] "Lacrosse" "Basketball" "Tennis" "Cricket" "Soccer"

nchar(q) # no. of characters

## [1] 6 8 8 6 5 8 10 6 7 6

nchar(y) # no. of digits

## [1] 2 2 2 2 1 1 1 1 1 1

# Subscripting:Accessing "individual elements" in vector is done using


square brackets []
x[1]

## [1] 10

x[1:2]

## [1] 10 9

Page |9
x[c(1:5,9)]

## [1] 10 9 8 7 6 2

# Give Names to Vector!


c(One = "a", Two = "y", Last = "r") # Name-Value pair

## One Two Last


## "a" "y" "r"

# You can Name the vector after creating vector as well!


w = 1:3
names(w) = c("a","b","c")
w

## a b c
## 1 2 3

3.2. Factors in R
17/12/2020
# Factor Vectors - Ordinal data [Ordered Categorical]
# Factors are important concept in R, esp. when building models
# Nominal - unordered (sachin, rahul), Ordinal - ordered (supervisor,
GM,AM,AGM)
# Nominal - character, Ordinal - Factor
q = c("Hockey","Lacrosse","Hockey","Water Polo","Hockey","Lacrosse")
q2 = c(q,"Hockey","Lacrosse","Hockey","Water Polo","Hockey","Lacrosse")
q2

## [1] "Hockey" "Lacrosse" "Hockey" "Water Polo" "Hockey"


## [6] "Lacrosse" "Hockey" "Lacrosse" "Hockey" "Water Polo"
## [11] "Hockey" "Lacrosse"

class(q2)

## [1] "character"

as.numeric(q2)

## Warning: NAs introduced by coercion

## [1] NA NA NA NA NA NA NA NA NA NA NA NA

class(q2)

## [1] "character"

# Converting "q2" to factor!


q2_F = as.factor(q2)
q2_F # notice the "Levels" info in the output!

P a g e | 10
## [1] Hockey Lacrosse Hockey Water Polo Hockey Lacrosse
## [7] Hockey Lacrosse Hockey Water Polo Hockey Lacrosse
## Levels: Hockey Lacrosse Water Polo

# 11 Levels - 10 Distinct Names from "q" and one (Water polo) from "q2"
# The "levels" of a factor are the unique values of that factor variable.
# Technically R is giving "unique integer" to each distinct names, See
below
as.numeric(q2_F)# IN the O/P --> Notice "6" = "Hockey"

## [1] 1 2 1 3 1 2 1 2 1 3 1 2

# numbers allotted to words on alphabatical basis.

# Ordered Levels and Un-ordered Levels


# Factors can drastically reduce the size of the variable...
# ... because they are storing only unique values!
factor(x=c("High School","College","Masters","Doctrate"),
levels = c("High School","College","Masters","Doctrate"),
ordered = TRUE)

## [1] High School College Masters Doctrate


## Levels: High School < College < Masters < Doctrate

3.3. Missing data


17/12/2020
# Missing data plays a crucial role in computing and Statistics
# R has two types of missing data - NA and NULL
# while they are similar, but they behave differently and hence needs
attention!

# NA - Missing data - Missing Value


z = c(1,2,NA,8,3,NA,3)
z

## [1] 1 2 NA 8 3 NA 3

# "is.na" tests each element of a vector for missingness


is.na(z)

## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE

#Another example
z_char = c("Hockey", NA ,"Cricket")
z_char

## [1] "Hockey" NA "Cricket"

is.na(z_char)

## [1] FALSE TRUE FALSE

P a g e | 11
# NULL - Absence of anything. It is not exactly missingness, but
nothingness
# Eg: Having Brain but thinking Nothing! - Makes Sense!!!
# Functions can sometimes return NULL and their arguments can be NULL.
# Important difference is, NULL is atomical and cannot exist within a
vector...
# ...If used inside a vector, it simply disappears! Let's see...
z= c(1,NULL,3)
z

## [1] 1 3

x = c(1,NA,3)
x

## [1] 1 NA 3

# Notice, here the "NULL" didnot get stored in "z", infact "z" has only
length of 2!
length(z)

## [1] 2 # NULL will not be counted in length

length(x)

## [1] 3 # NA will be counted in length

# Assigning NULL and checking!


d = NULL
is.null(d)

## [1] TRUE

4.1. Data Structure Intro


Vaibhav Kumar

17/12/2020
# Data Structures in R
# Data come in many types and structures which can pose a problem for
some...
# ...analysis environments but R handles them with ease.

## VECTOR
# The most common data structure is the one-dimensional vector
# Vector forms the basis of everything in R.
# A vector is collection of elements of same type.
# (ie) A vector cannot be of mixed type.
# R is a Vectorized Language. That means operations are applied to each
element of the vector automatically,
# .., without the need to loop through the vector.

P a g e | 12
# This is a powerful concept and vector plays a crucial and significant
role in R.

## DATA FRAME
# Data Frames(DF) - Most useful features of R & also cited reason for R's
ease of use.
# In dataframe, each column is actually a vector, each of which has same
length.
# Each column can hold different type of data.
# Also within each column, each element must be of same type, like
vectors.

## MATRICES
# A matrix (plural matrices) is a rectangular array or table of numbers,
symbols, or expressions...
#..., arranged in rows and columns.(i.e.) 2-Dimensional Array
# Similar to data.frame(RxC) and also similar to Vector
# Matrix - Element by element operations are possible.

## ARRAYS
# Arrays - An array is essentially a multidimensional vector.
# It must all be of the same type and
# ...individual elements are accessed using Square Brackets.
# First element is Row(R) Index, Second Element is Column(C) Index and
# the remaining elements are for Outer Dimensions (OD).

## LIST
# Lists - Stores any number of items of any type.
# List can contain all numerics or characters or...
#...a mix of the two or data.frames or recursively other lists.

# Sometimes data requires more complex storage than simple vectors.


# Data Structures - Apart from Vectors, we have Data Frames, Matrix, List
and Array.

4.2. Data structure – Data Frame

# Data Frames(DF) - Most useful features of R & also cited reason for R's
ease of use.
# In dataframe, each column is actually a vector, each of which has same
length.
# Each column can hold different type of data.
# Also within each column, each element must be of same type, like vectors

P a g e | 13
# Creating a Dataframe from vectors

x = 10:1
y = -4:5
q = c("Hockey","Football","Baseball","Curlin","Rugby","Lacrosse",
"Basketball","Tennis","Cricket","Soccer")
# to combine these 3 vectors, we will use data frame
theDF = data.frame(x,y,q) # this would create a 10x3 data.frame with x, y
and q as variable names
theDF

## x y q
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

str(theDF) # This will give the structure of dataframen like data type,
levels etc.

## 'data.frame': 10 obs. of 3 variables:


## $ x: int 10 9 8 7 6 5 4 3 2 1
## $ y: int -4 -3 -2 -1 0 1 2 3 4 5
## $ q: chr "Hockey" "Football" "Baseball" "Curlin" ...

q = as.factor(q) # to convert q into a factor and then assign it to q.

# Assigning Names to clumn varibales


theDF = data.frame (First=x, Second =y, Sport = q)
theDF

## First Second Sport


## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

# Checking the dimensions of the DF.


nrow(theDF)

## [1] 10

P a g e | 14
ncol(theDF)

## [1] 3

dim(theDF)

## [1] 10 3

names (theDF)

## [1] "First" "Second" "Sport"

names(theDF)[3] # whenever we require specific row or column, we use[].

## [1] "Sport"

rownames(theDF)

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

# Head and Tail


head(theDF)# First 6 rows with all columns

## First Second Sport


## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse

head(theDF, n=7) #first 7 rows with all columns

## First Second Sport


## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball

tail(theDF)# last six rows with all columns

## First Second Sport


## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

class(theDF)

## [1] "data.frame"

# Accessing Individual Column using $


theDF$Sport # gives the third column named Sport

P a g e | 15
## [1] Hockey Football Baseball Curlin Rugby Lacrosse
## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

# Accessing Specific row and column


theDF[3,2] # 3rd row and 2nd Column

## [1] -2

theDF[3,2:3] # 3rd Row and column 2 thru 3

## Second Sport
## 3 -2 Baseball

theDF[c(3,5), 2]# Row 3&5 from Column 2;

## [1] -2 0

# since only one column was selected, it was returned as vector and hence
no column names in output.

# Rows 3&5 and Columns 2 through 3


theDF[c(3,5), 2:3]

## Second Sport
## 3 -2 Baseball
## 5 0 Rugby

theDF[ ,3] # Access all Rows for column 3

## [1] Hockey Football Baseball Curlin Rugby Lacrosse


## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

theDF[ , 2:3] # Access all Rows for column 2 & 3

## Second Sport
## 1 -4 Hockey
## 2 -3 Football
## 3 -2 Baseball
## 4 -1 Curlin
## 5 0 Rugby
## 6 1 Lacrosse
## 7 2 Basketball
## 8 3 Tennis
## 9 4 Cricket
## 10 5 Soccer

theDF[2,]# Access all columns for Row 2

## First Second Sport


## 2 9 -3 Football

P a g e | 16
theDF[2:4,]

## First Second Sport


## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin

theDF[ , c("First", "Sport")]# access using Column Names

## First Sport
## 1 10 Hockey
## 2 9 Football
## 3 8 Baseball
## 4 7 Curlin
## 5 6 Rugby
## 6 5 Lacrosse
## 7 4 Basketball
## 8 3 Tennis
## 9 2 Cricket
## 10 1 Soccer

theDF[ ,"Sport"]# Access specific Column

## [1] Hockey Football Baseball Curlin Rugby Lacrosse


## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

class(theDF[ ,"Sport"])

## [1] "factor"

theDF["Sport"]# This returns the one column data.frame

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis
## 9 Cricket
## 10 Soccer

class(theDF["Sport"]) # Data.Frame

## [1] "data.frame"

theDF[["Sport"]]#To access Specific column using Double Square Brackets

## [1] Hockey Football Baseball Curlin Rugby Lacrosse


## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

P a g e | 17
class(theDF[["Sport"]]) # Factor

## [1] "factor"

theDF[ ,"Sport", drop = FALSE]# Use "Drop=FALSE" to get data.fame with


single sqaure bracket.

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis
## 9 Cricket
## 10 Soccer

class(theDF[ ,"Sport", drop = FALSE]) # data.frame

## [1] "data.frame"

theDF[ ,3, drop = FALSE]

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis
## 9 Cricket
## 10 Soccer

class(theDF[ ,3, drop = FALSE]) # data.frame

## [1] "data.frame"

# To see how factor is stored in data.frame


newFactor = factor(c("Pennsylvania","New York","New Jersey","New
York","Tennessee","Massachusetts","Pennsylvania","New York"))
newFactor

## [1] Pennsylvania New York New Jersey New York Tennessee

## [6] Massachusetts Pennsylvania New York


## Levels: Massachusetts New Jersey New York Pennsylvania Tennessee

# model.matrix(~newFactor -1)
# ? model.matrix() # To be understand

P a g e | 18
4.3 Data Structure Matrices
Vaibhav Kumar

17/12/2020
# A matrix (plural matrices) is a rectangular array or table of numbers,
symbols, or expressions...
#..., arranged in rows and columns.(i.e.) 2-Dimensional Array

# Similar to data.frame(RxC) and also similar to Vector


# Matrix - Element by element operations are possible

A = matrix(1:10, nrow=5)# Create a 5x2 matrix. Nrow means no. of rows


B = matrix(21:30, nrow=5)#Create another 5x2 matrix
C = matrix (21:40, nrow=2)#Create another 2x10 matrix

## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10

## [,1] [,2]
## [1,] 21 26
## [2,] 22 27
## [3,] 23 28
## [4,] 24 29
## [5,] 25 30

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 21 23 25 27 29 31 33 35 37 39
## [2,] 22 24 26 28 30 32 34 36 38 40

nrow(A)

## [1] 5

ncol(A)

## [1] 2

dim(A)

## [1] 5 2

# Add Them
A+B

P a g e | 19
## [,1] [,2]
## [1,] 22 32
## [2,] 24 34
## [3,] 26 36
## [4,] 28 38
## [5,] 30 40

# Multiply Them (Vector Multiplication!)


A

## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10

## [,1] [,2]
## [1,] 21 26
## [2,] 22 27
## [3,] 23 28
## [4,] 24 29
## [5,] 25 30

A*B # A = 5x2 and B = 5x2

## [,1] [,2]
## [1,] 21 156
## [2,] 44 189
## [3,] 69 224
## [4,] 96 261
## [5,] 125 300

#See if the elements are equal


A == B

## [,1] [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,] FALSE FALSE
## [5,] FALSE FALSE

# Matrix Multiplication ( 5x2 and 5X2 matrix multiplication if these


matrices is not possible).

# Matrix Multiplication(MM. A is 5x2. B is 5x2. B-transpose is 2x5


A %*% t(B) # %*% is used for matrix multiplication

## [,1] [,2] [,3] [,4] [,5]


## [1,] 177 184 191 198 205
## [2,] 224 233 242 251 260
## [3,] 271 282 293 304 315

P a g e | 20
## [4,] 318 331 344 357 370
## [5,] 365 380 395 410 425

# Naming the Columns and Rows


colnames(A)

## NULL

rownames(A)

## NULL

colnames(A)= c("Left","Right")
rownames(A)= c("1st","2nd","3rd","4th","5th")
colnames(B)

## NULL

rownames(B)

## NULL

colnames(B)= c("First","Second")
rownames(B)= c("One","Two","Three","Four","Five")
colnames(C)

## NULL

rownames(C)

## NULL

colnames(C) = LETTERS [1:10]


rownames(C) = c("Top", "Bottom")

# Matrix Multiplication. A is 5x2 and C is 2x10


dim(A)

## [1] 5 2

dim(C)

## [1] 2 10

t(A)

## 1st 2nd 3rd 4th 5th


## Left 1 2 3 4 5
## Right 6 7 8 9 10

A %*% C

## A B C D E F G H I J
## 1st 153 167 181 195 209 223 237 251 265 279
## 2nd 196 214 232 250 268 286 304 322 340 358
## 3rd 239 261 283 305 327 349 371 393 415 437
## 4th 282 308 334 360 386 412 438 464 490 516
## 5th 325 355 385 415 445 475 505 535 565 595

P a g e | 21
4.4 Data Structure Array
Vaibhav Kumar

17/12/2020
# Arrays - An array is essentially a multidimensional vector.
# It must all be of the same type and
# ...individual elements are accessed using Square Brackets.
# First element is Row(R) Index, Second Element is Column(C) Index and
# the remaining elements are for Outer Dimensions (OD).

theArray = array(1:12, dim=c(2,3,2))# Total Elements = R x C x OD


theArray

## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12

theArray [1, ,]# Accessing all elements from Row 1, all columns, all outer
dimensions & build C x OD (R x C)

## [,1] [,2]
## [1,] 1 7
## [2,] 3 9
## [3,] 5 11

theArray[1, ,1]# Accessing all elements from Row 1, all columns, first
outer dimension

## [1] 1 3 5

theArray[, ,1]# Accessing all rows, all columns, first outer dimension

## [,1] [,2] [,3]


## [1,] 1 3 5
## [2,] 2 4 6

# Array with Four Outer Dimensions (OD)


theArray_4D = array(1:32, dim=c(2,4,4))
theArray_4D

## , , 1
##

P a g e | 22
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 9 11 13 15
## [2,] 10 12 14 16
##
## , , 3
##
## [,1] [,2] [,3] [,4]
## [1,] 17 19 21 23
## [2,] 18 20 22 24
##
## , , 4
##
## [,1] [,2] [,3] [,4]
## [1,] 25 27 29 31
## [2,] 26 28 30 32

theArray_4D [1, ,] # Accessing all elements from Row 1, all columns, all
outer dimensions & build C x OD (R x C)

## [,1] [,2] [,3] [,4]


## [1,] 1 9 17 25
## [2,] 3 11 19 27
## [3,] 5 13 21 29
## [4,] 7 15 23 31

theArray_4D[1, ,1]

## [1] 1 3 5 7

theArray[, ,1]

## [,1] [,2] [,3]


## [1,] 1 3 5
## [2,] 2 4 6

4.5. Data Structures List


Vaibhav Kumar

17/12/2020
# Lists - Stores any number of items of any type.
# List can contain all numerics or characters or...
#...a mix of the two or data.frames or recursively other lists.

P a g e | 23
# Lists are created with the "list" function.
# Each argument in "list" becomes an element of the list.

list(1,2,3)# creates a three element list

## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3

list(c(1,2,3))# creates a single element(vector with three elements)

## [[1]]
## [1] 1 2 3

list3 = list(c(1,2,3), 3:7)# create two element list


# first is three elements vector, next is five element vector.
list3

## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 3 4 5 6 7

# The same can be written as


(list3 = list(c(1,2,3), 3:7))

## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 3 4 5 6 7

# Two Element list


# First element is data.frame and next is 10 element vector

x = 10:1
y = -4:5
q = c("Hockey","Football","Baseball","Curlin","Rugby","Lacrosse",
"Basketball","Tennis","Cricket","Soccer")
theDF = data.frame(x,y,q) # this would create a 10x3 data.frame with x, y
and q as variable names
theDF

## x y q
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby

P a g e | 24
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

str(theDF)# Very important - Str - Structure

## 'data.frame': 10 obs. of 3 variables:


## $ x: int 10 9 8 7 6 5 4 3 2 1
## $ y: int -4 -3 -2 -1 0 1 2 3 4 5
## $ q: chr "Hockey" "Football" "Baseball" "Curlin" ...

q = as.factor(q)

# Assigning Names
theDF = data.frame (First=x, Second =y, Sport = q)
theDF

## First Second Sport


## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

# Checking the dimensions of the DF.


nrow(theDF)

## [1] 10

ncol(theDF)

## [1] 3

dim(theDF)

## [1] 10 3

names (theDF)

## [1] "First" "Second" "Sport"

names(theDF)[3]

## [1] "Sport"

rownames(theDF)

## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

P a g e | 25
# Head and Tail
head(theDF)# First 6 rows with all columns

## First Second Sport


## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse

head(theDF, n=10)

## First Second Sport


## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

tail(theDF)# last six rows with all columns

## First Second Sport


## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

class(theDF)

## [1] "data.frame"

# Accessing Individual Column using $


theDF$Sport # gives the third column named Sport

## [1] Hockey Football Baseball Curlin Rugby Lacrosse


## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

# Accessing Specific row and column


theDF[3,2] # 3rd row and 2nd Column

## [1] -2

theDF[3,2:3] # 3rd Row and column 2 thru 3

## Second Sport
## 3 -2 Baseball

P a g e | 26
theDF[c(3,5), 2]# Row 3&5 from Column 2;

## [1] -2 0

# since only one column was selected, it was returned as vector and hence
no column names in output.

# Rows 3&5 and Columns 2 through 3


theDF[c(3,5), 2:3]

## Second Sport
## 3 -2 Baseball
## 5 0 Rugby

theDF[ ,3] # Access all Rows for column 3

## [1] Hockey Football Baseball Curlin Rugby Lacrosse


## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

theDF[ , 2:3]

## Second Sport
## 1 -4 Hockey
## 2 -3 Football
## 3 -2 Baseball
## 4 -1 Curlin
## 5 0 Rugby
## 6 1 Lacrosse
## 7 2 Basketball
## 8 3 Tennis
## 9 4 Cricket
## 10 5 Soccer

theDF[2,]# Access all columns for Row 2

## First Second Sport


## 2 9 -3 Football

theDF[2:4,]

## First Second Sport


## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin

theDF[ , c("First", "Sport")]# access using Column Names

## First Sport
## 1 10 Hockey
## 2 9 Football
## 3 8 Baseball
## 4 7 Curlin
## 5 6 Rugby
## 6 5 Lacrosse
## 7 4 Basketball

P a g e | 27
## 8 3 Tennis
## 9 2 Cricket
## 10 1 Soccer

theDF[ ,"Sport"]# Access specific Column

## [1] Hockey Football Baseball Curlin Rugby Lacrosse


## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

class(theDF[ ,"Sport"])

## [1] "factor"

theDF["Sport"]# This returns the one column data.frame

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis
## 9 Cricket
## 10 Soccer

class(theDF["Sport"]) # Data.Frame

## [1] "data.frame"

theDF[["Sport"]]#To access Specific column using Double Square Brackets

## [1] Hockey Football Baseball Curlin Rugby Lacrosse


## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

class(theDF[["Sport"]]) # Factor

## [1] "factor"

theDF[ ,"Sport", drop = FALSE]# Use "Drop=FALSE" to get data.fame with


single sqaure bracket.

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis

P a g e | 28
## 9 Cricket
## 10 Soccer

class(theDF[ ,"Sport", drop = FALSE]) # data.frame

## [1] "data.frame"

theDF[ ,3, drop = FALSE]

## Sport
## 1 Hockey
## 2 Football
## 3 Baseball
## 4 Curlin
## 5 Rugby
## 6 Lacrosse
## 7 Basketball
## 8 Tennis
## 9 Cricket
## 10 Soccer

class(theDF[ ,3, drop = FALSE]) # data.frame

## [1] "data.frame"

# To see how factor is stored in data.frame


newFactor = factor(c("Pennsylvania","New York","New Jersey","New
York","Tennessee","Massachusetts","Pennsylvania","New York"))
newFactor

## [1] Pennsylvania New York New Jersey New York Tennessee

## [6] Massachusetts Pennsylvania New York


## Levels: Massachusetts New Jersey New York Pennsylvania Tennessee

# model.matrix(~newFactor -1)
# ? model.matrix()
list(theDF, 1:10)# theDF is already created in previous exercise!

## [[1]]
## First Second Sport
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10

P a g e | 29
# Three element list
list5 = list(theDF, 1:10, list3)
list5

## [[1]]
## First Second Sport
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[3]]
## [[3]][[1]]
## [1] 1 2 3
##
## [[3]][[2]]
## [1] 3 4 5 6 7

#Naming List (similar to column name in data.frame)


names(list5)= c("data.frame", "vector","list")
names(list5)

## [1] "data.frame" "vector" "list"

list5

## $data.frame
## First Second Sport
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer
##
## $vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $list
## $list[[1]]
## [1] 1 2 3
##

P a g e | 30
## $list[[2]]
## [1] 3 4 5 6 7

#Naming using "Name-Value" pair


list6 = list(TheDataFrame = theDF, TheVector = 1:10, TheList = list3)
names(list6)

## [1] "TheDataFrame" "TheVector" "TheList"

list6

## $TheDataFrame
## First Second Sport
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer
##
## $TheVector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $TheList
## $TheList[[1]]
## [1] 1 2 3
##
## $TheList[[2]]
## [1] 3 4 5 6 7

# Creating an empty list


(emptylist = vector(mode="list", length =4))

## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL

# Accessing individual element of a list - Double Square Brackets


# specify either element number or name
list5[[1]]

## First Second Sport


## 1 10 -4 Hockey

P a g e | 31
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

list5[["data.frame"]]

## First Second Sport


## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer

list5[[1]]$Sport

## [1] Hockey Football Baseball Curlin Rugby Lacrosse


## [7] Basketball Tennis Cricket Soccer
## 10 Levels: Baseball Basketball Cricket Curlin Football Hockey ...
Tennis

list5[[1]][,"Second"]

## [1] -4 -3 -2 -1 0 1 2 3 4 5

list5[[1]][,"Second", drop = FALSE]

## Second
## 1 -4
## 2 -3
## 3 -2
## 4 -1
## 5 0
## 6 1
## 7 2
## 8 3
## 9 4
## 10 5

# LENGTH OF LIST
length(list5)

## [1] 3

names(list5)

P a g e | 32
## [1] "data.frame" "vector" "list"

list5

## $data.frame
## First Second Sport
## 1 10 -4 Hockey
## 2 9 -3 Football
## 3 8 -2 Baseball
## 4 7 -1 Curlin
## 5 6 0 Rugby
## 6 5 1 Lacrosse
## 7 4 2 Basketball
## 8 3 3 Tennis
## 9 2 4 Cricket
## 10 1 5 Soccer
##
## $vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $list
## $list[[1]]
## [1] 1 2 3
##
## $list[[2]]
## [1] 3 4 5 6 7

4.6. Reading data in R


Vaibhav Kumar

18/12/2020
# Its time that we load data in R.
# Most common way to get data is reading comma separated values(CSV)

# Reading CSVs
#theUrl = "https://www.jaredlander.com/data/RetailFood.csv"
# visit https://www.jaredlander.com/data/ for other Datasets
#RetailFood = read.table(file=theUrl, header=TRUE, sep =",") here values
are separated by “,” and header to read the header
#head(RetailFood) # to read first 6 rows with all columns

#We can also use read.csv instead of read.table but it will work if file
is of csv extension. It might be tempting to use read.csv but that is more
trouble than it is worth,
#...and all it does is call read.table with some arguments preset.

# Sometimes CSVs(or tab delimited files) are poorly built,


# where the cell separator has been used inside a cell.

P a g e | 33
# In this case read.csv2(or read.delim2)should be used instead of
read.table.

# Reading Excel Data - Not worth the Effort.


# Unfortunately, it is difficult to read Excel data into R - Requires
additional packages to be installed.
# Convert into CSV and read.

# Reading Text Files


#Garden = read.table("C:\\Users\\Vaibhav\\Documents\\R\\Data Structure
\\data.txt",header=TRUE,sep="")
#head(Garden)
# We cannot use “\” here, as it is assigned for other purpose in R.
Instead of “\” we can use “/” or “\\”.

#R Binary Files
# save the tomato data.frame to Disk
#save(RetailFood, file="C:\\Users\\Vaibhav\\Documents\\R\\Data
Structure\\RetailFood.rdata")
# remove tomato from memory
#rm(RetailFood)
# Check if it still exists
#head(RetailFood)
# read it from the rdata file
#load("C:\\Users\\Vaibhav\\Documents\\R\\Data
Structure\\RetailFood.rdata")
#head(RetailFood)

# Read data from anywhere in the Disk/Computer


# myData = read.csv(file.choose()) # No working directory setup is needed.
# but if we use file.choose, there are issues with header.

4.7 Built in Data in R


Vaibhav Kumar

18/12/2020
# R has various packages which need to be install and also contain various
data sets
# Built-in datasets in R
# for example
# data()# List of built-in Datasets in R. Open in different tab.

# Loading
# data(mtcars)
# Print the first 6 rows
# head(mtcars, 6)

P a g e | 34
5.1 Basic statistics – Mean, Median
Vaibhav Kumar

18/12/2020
# Basic Statistics - Mean, Variances,Correlations and T-tests

# Generate a random sample of 100 numbers between 1 and 100


x = sample(x=1:100, size = 20, replace = TRUE) # true is used for repeat
value and for unique value false is used
x # the output of "x" is a vector of data

## [1] 96 71 18 26 15 10 18 39 28 68 58 13 25 38 57 95 60 22 89 93

# Simple Arithmetic Mean


mean(x)

## [1] 46.95

# Calculate Mean when Missing Data is found


y = x # copy x to y
y = sample(x=1:100, size = 20, replace = FALSE) #= NA # Null Values
y

## [1] 84 57 86 18 12 48 23 93 69 20 29 70 7 66 82 76 65 99 67 56

y = sample(x=1:100, size = 20, replace = FALSE)


y

## [1] 49 63 30 73 61 53 72 38 94 70 41 33 60 58 75 47 95 11 39 46

mean(y)# Will give NA! because sample contains both numerical and
character (NA)

## [1] 55.4

# Remove missing value(s)and calculate mean


mean(y, na.rm=TRUE) # Now, it will give the mean value

## [1] 55.4

# Weighted Mean
Grades = c(65,90,54,78)
Weights = c(1/8, 1/8, 1/4, 1/2)
mean(Grades)# Simple Arithmetic mean

## [1] 71.75

weighted.mean(x = Grades, w = Weights)# Weighted Mean

## [1] 71.875

#Variance
var(x)

## [1] 909.4184

P a g e | 35
#Calculating Variance using formula!
sum((x-mean(x))^2)/ (length(x)-1)

## [1] 909.4184

# Standard Deviation
sqrt(var(x)) #square root of variance

## [1] 30.15657

sd(x)

## [1] 30.15657

sd(y)

## [1] 21.10226

sd(y, na.rm=TRUE) # to remove NA and calculate standard deviation

## [1] 21.10226

# Other Commonly Used Functions


min(x)

## [1] 10

max(x)

## [1] 96

median(x)

## [1] 38.5

min(y)

## [1] 11

min(y, na.rm=TRUE)

## [1] 11

# Summary Statistics
summary(x) # it provides min, max, median, mean, 1st qu. and 3rd qu.

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 10.00 21.00 38.50 46.95 68.75 96.00

summary(y) # BOX PLOT

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 11.0 40.5 55.5 55.4 70.5 95.0

# Quantiles
quantile(x, probs = c(0.25, 0.75)) # Calculate 25th and 75th Quantile

## 25% 75%
## 21.00 68.75

P a g e | 36
quantile(x, probs = c(0.1,0.25,0.5, 0.75,0.99)) # to calculate value at
specific length

## 10% 25% 50% 75% 99%


## 14.80 21.00 38.50 68.75 95.81

quantile(y, probs = c(0.25, 0.75)) # Calculate 25th and 75th Quantile

## 25% 75%
## 40.5 70.5

quantile(y, probs = c(0.25, 0.75), na.rm = TRUE)

## 25% 75%
## 40.5 70.5

# Correlation and Covariance


#install.packages("ggplot2")
library(ggplot2)# require(ggplot2)
head(economics)# Built-in dataset in ggplot2 package

## # A tibble: 6 x 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
## 4 1967-10-01 512. 199311 12.9 4.9 3143
## 5 1967-11-01 517. 199498 12.8 4.7 3066
## 6 1967-12-01 525. 199657 11.8 4.8 3018

cor(economics$pce, economics$psavert) #pce-Personal Consumption


Expenditure;psavert -Personal Savings Rate

## [1] -0.7928546

# To compare correlation for Multiple variables


cor(economics[, c(2,4:6)]) #correlation between 2,4,5,6 columns

## pce psavert uempmed unemploy


## pce 1.0000000 -0.7928546 0.7269616 0.6145176
## psavert -0.7928546 1.0000000 -0.3251377 -0.3093769
## uempmed 0.7269616 -0.3251377 1.0000000 0.8693097
## unemploy 0.6145176 -0.3093769 0.8693097 1.0000000

# Display Correlation in Different Format!

# Lets install the required package and load them onto this R environment
for executing!!!

# Load the "reshape" package


#install.packages("reshape2")
require(reshape2)

## Loading required package: reshape2

P a g e | 37
# Also load the Scales package for some extra plotting features
#install.packages("scales")
library(scales)

econCor = cor(economics [ , c(2,4:6)])


# use "melt()" to change into long format
#?melt() # Help on melt function
econMelt = melt(econCor, varnames = c("x" ,"y"), value.name =
"Correlation")
# Order it according to correlation
econMelt = econMelt[order(econMelt$Correlation),]
# Display the melted data
econMelt

## x y Correlation
## 2 psavert pce -0.7928546
## 5 pce psavert -0.7928546
## 7 uempmed psavert -0.3251377
## 10 psavert uempmed -0.3251377
## 8 unemploy psavert -0.3093769
## 14 psavert unemploy -0.3093769
## 4 unemploy pce 0.6145176
## 13 pce unemploy 0.6145176
## 3 uempmed pce 0.7269616
## 9 pce uempmed 0.7269616
## 12 unemploy uempmed 0.8693097
## 15 uempmed unemploy 0.8693097
## 1 pce pce 1.0000000
## 6 psavert psavert 1.0000000
## 11 uempmed uempmed 1.0000000
## 16 unemploy unemploy 1.0000000

# Let's Visualize Correlation


## Plot it with ggplot
# Initialize the plot with x and y on the respective axes
ggplot(econMelt,aes (x=x, y=y),geom_tile(aes(fill =
Correlation)),scale_fill_gradient2(low = muted("red"), mid = "white", high
= "steelblue",guide = guide_colorbar(ticks=FALSE, barheight=10), limit=c(-
1,1), theme_minimal(), labs(x= NULL, y=NULL)))

P a g e | 38
5.2. Correlation Heatmap
Vaibhav Kumar

18/12/2020
# Correlation

# Prepare the Data


mydata <- mtcars[, c(1,3,4,5,6,7)]
head(mydata)

## mpg disp hp drat wt qsec


## Mazda RX4 21.0 160 110 3.90 2.620 16.46
## Mazda RX4 Wag 21.0 160 110 3.90 2.875 17.02
## Datsun 710 22.8 108 93 3.85 2.320 18.61
## Hornet 4 Drive 21.4 258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7 360 175 3.15 3.440 17.02
## Valiant 18.1 225 105 2.76 3.460 20.22

# Compute the correlation matrix - cor()


cormat <- round(cor(mydata),2)
head(cormat)

## mpg disp hp drat wt qsec


## mpg 1.00 -0.85 -0.78 0.68 -0.87 0.42
## disp -0.85 1.00 0.79 -0.71 0.89 -0.43

P a g e | 39
## hp -0.78 0.79 1.00 -0.45 0.66 -0.71
## drat 0.68 -0.71 -0.45 1.00 -0.71 0.09
## wt -0.87 0.89 0.66 -0.71 1.00 -0.17
## qsec 0.42 -0.43 -0.71 0.09 -0.17 1.00

# Create the correlation heatmap with ggplot2


# The package reshape is required to melt the correlation matrix.
library(reshape2)
melted_cormat <- melt(cormat)
head(melted_cormat)

## Var1 Var2 value


## 1 mpg mpg 1.00
## 2 disp mpg -0.85
## 3 hp mpg -0.78
## 4 drat mpg 0.68
## 5 wt mpg -0.87
## 6 qsec mpg 0.42

#The function geom_tile()[ggplot2 package] is used to visualize the


correlation matrix :
library(ggplot2)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile()

#Doesnot Look Great.. Let's Enhance the viz!

#Get the lower and upper triangles of the correlation matrix


## a correlation matrix has redundant information. We'll use the functions
below to set half of it to NA.

P a g e | 40
# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}

upper_tri <- get_upper_tri(cormat)


upper_tri

## mpg disp hp drat wt qsec


## mpg 1 -0.85 -0.78 0.68 -0.87 0.42
## disp NA 1.00 0.79 -0.71 0.89 -0.43
## hp NA NA 1.00 -0.45 0.66 -0.71
## drat NA NA NA 1.00 -0.71 0.09
## wt NA NA NA NA 1.00 -0.17
## qsec NA NA NA NA NA 1.00

# Finished correlation matrix heatmap


## Melt the correlation data and drop the rows with NA values
# Melt the correlation matrix
library(reshape2)
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()

P a g e | 41
# negative correlations are in blue color and positive correlations in
red.
# The function scale_fill_gradient2 is used with the argument limit = c(-
1,1) as correlation coefficients range from -1 to 1.
# coord_fixed() : this function ensures that one unit on the x-axis is the
same length as one unit on the y-axis.

# Reorder the correlation matrix

# This section describes how to reorder the correlation matrix according


to the correlation coefficient.
# This is useful to identify the hidden pattern in the matrix.
# hclust for hierarchical clustering order is used in the example below.

reorder_cormat <- function(cormat){


# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}

# Reorder the correlation matrix


cormat <- reorder_cormat(cormat)
upper_tri <- get_upper_tri(cormat)
# Melt the correlation matrix
melted_cormat <- melt(upper_tri, na.rm = TRUE)
# Create a ggheatmap
ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",

P a g e | 42
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+ # minimal theme
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
# Print the heatmap
print(ggheatmap)

#Add correlation coefficients on the heatmap

## Use geom_text() to add the correlation coefficients on the graph


## Use a blank theme (remove axis labels, panel grids and background, and
axis ticks)
## Use guides() to change the position of the legend title

ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))

P a g e | 43
5.3. Hypothesis Testing
Vaibhav Kumar

18/12/2020
# T-tests
# Dataset: Tips dependents on...
data(tips, package = "reshape2")
head(tips)

## total_bill tip sex smoker day time size


## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4

str(tips) # to find out no. of levels

## 'data.frame': 244 obs. of 7 variables:


## $ total_bill: num 17 10.3 21 23.7 24.6 ...
## $ tip : num 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2
2 ...
## $ smoker : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3
3 3 ...
## $ time : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1

P a g e | 44
...
## $ size : int 2 3 3 2 4 4 2 4 2 2 ...

write.csv(tips, "C:\\Users\\Vaibhav\\Documents\\R\\Basic
statistics\\tips.csv", row.names = FALSE) # to save tips file in excel in
computer

# Gender
unique(tips$sex) # levels

## [1] Female Male


## Levels: Female Male

#Day of the week


unique(tips$day) # levels

## [1] Sun Sat Thur Fri


## Levels: Fri Sat Sun Thur

#One Sample t-test - ONE GROUP [Two Tail. Ho:Mean = 2.5]


t.test(tips$tip, alternative = "two.sided", mu=2.5)

##
## One Sample t-test
##
## data: tips$tip
## t = 5.6253, df = 243, p-value = 5.08e-08
## alternative hypothesis: true mean is not equal to 2.5
## 95 percent confidence interval:
## 2.823799 3.172758
## sample estimates:
## mean of x
## 2.998279

# Result - p value is less than 0.05, so we reject the null hypothesis.


Mean is not equal to 2.5.

#One Sample t-test - Upper Tail. Ho:Mean LE 2.5


t.test(tips$tip, alternative = "greater", mu=2.5)

##
## One Sample t-test
##
## data: tips$tip
## t = 5.6253, df = 243, p-value = 2.54e-08
## alternative hypothesis: true mean is greater than 2.5
## 95 percent confidence interval:
## 2.852023 Inf
## sample estimates:
## mean of x
## 2.998279

# Result - p value is less than 0.05, so we reject the null hypothesis.


Mean is not equal to 2.5.

# Two Sample T-test - TWO GROUP

P a g e | 45
t.test(tip ~ sex, data = tips, var.equal = TRUE) # Male and Female are
independent. Assuming variance for both is equal.

##
## Two Sample t-test
##
## data: tip by sex
## t = -1.3879, df = 242, p-value = 0.1665
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6197558 0.1074167
## sample estimates:
## mean in group Female mean in group Male
## 2.833448 3.089618

# Result - P value is greater than 0.05, so we accept null hypothesis.


Hence average tip given by male and female are same.

#Paired Two-Sample T-Test


# Dataset: Heights of Father and Son (Package:UsingR)

# used when samples are dependent


#install.packages("UsingR")
require(UsingR)

## Loading required package: UsingR

## Loading required package: MASS

## Loading required package: HistData

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

##
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':


##
## format.pval, units

##
## Attaching package: 'UsingR'

## The following object is masked from 'package:survival':


##
## cancer

head(father.son)

P a g e | 46
## fheight sheight
## 1 65.04851 59.77827
## 2 63.25094 63.21404
## 3 64.95532 63.34242
## 4 65.75250 62.79238
## 5 61.13723 64.28113
## 6 63.02254 64.24221

write.csv(father.son, "C:\\Users\\Vaibhav\\Documents\\R\\Basic
statistics\\father_son.csv", row.names = FALSE)

#ANOVA – Analysis of variance

#ANOVA - Comparing Multiple Groups


# Tip by the Day of the Week
str(tips)

## 'data.frame': 244 obs. of 7 variables:


## $ total_bill: num 17 10.3 21 23.7 24.6 ...
## $ tip : num 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
## $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2
2 ...
## $ smoker : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3
3 3 ...
## $ time : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1
...
## $ size : int 2 3 3 2 4 4 2 4 2 2 ...

tipAnova = aov(tip ~ day, tips) # to compare whether tip on all days is


same or not.
summary(tipAnova)

## Df Sum Sq Mean Sq F value Pr(>F)


## day 3 9.5 3.175 1.672 0.174
## Residuals 240 455.7 1.899

# Day DF is c-1 and residuals DF is no. of samples – c. C = 4, so number


of samples is 244. And F value = 3.175/1.899

P a g e | 47
6. Learning from Assignment

 Understand R coding basics.


 Benefits of R over other languages.
 Understand use of vector in R.
 Learn Reading Data from external source
 Understand Different Types of Data Structure (Data Frame, Matrices,
Array & List).
 Understand how to use various data sets in available
in R.

 Understand null and alternates hypothesis.

 Understand how to use T-test and ANOVA in R.

P a g e | 48

You might also like