Project 4 - Cars-Datasets PDF

MACHINE LEARNING
P R E D IC T IN G MO DE O F T R A N S P O RT
ASAAJU BABATUNDE
P RO J E C T BAC KG RO U N D
Objective
The objective of this project is to understand what mode of transport the employees of a particular
organisation prefer to commute to workplace. That is to predict whether or not an employee will use
Car as a mode of transport.
This requires the need to explore the available dataset; “Cars-dataset” in R for exploratory analysis,
description, and prediction. The dataset "Cars-dataset" includes employee information about their
mode of transport as well as their personal and professional details like age, salary, work experience…
Assumptions
 Dataset file format is in csv format.
 Response variable is a factor data type with 2 levels
 Dataset has no missing data values
Expectations
 Exploratory Data Analysis -EDA
 Perform an EDA on the data
 Illustrate the insights based on EDA
 What is the most challenging aspect of this problem? What method will you use to deal with
this? Comment (3 marks)
 Data Preparation
 Prepare the data for analysis
 Modeling
 Create multiple models and explore how each model perform using appropriate model
performance metrics
o KNN
o Naive Bayes (is it applicable here? comment and if it is not applicable, how can
you build an NB model in this case?)
o Logistic Regression
 Apply both bagging and boosting modeling procedures to create 2 models and compare its
accuracy with the best model of the above step.
 Actionable Insights & Recommendations
 Summarize your findings from the exercise in a concise yet actionable note
2
EXPLORATORY DATA ANA LY SIS
A Typical Data exploration activity consists of the following steps:

 Environment Set up and Data Import
 Variable Identification
 Visualization
Univariate Analysis
Bi-Variate Analysis
ENVIRONMENT SET UP AND DATA IMPORT
Install all necessary library packages

 library(car)
 library(MASS)
 library(dummies)
 library(ggplot2)
 library(caret)
 library(Information)
 library(caTools)
 library(ROCR)
 library(dplyr)
 library(tidyr)
 library(corrplot)
 library(ggplot2)
 library(GGally)
 library(factoextra)
 library(e1071)
 library(lattice)
 library(mice)
 library(xgboost)
 library(class)
 library(gbm)
 library(ipred)
 library(rpart)
 library(DMwR)
 library(rms)
3
SETUP WORKING DIRECTORY
Setting a working directory on starting of the R session makes importing and exporting data files and
code files easier. Basically, working directory is the location/ folder on the PC where i have the
dataset related to the project..
The path on my computer is C:\Users\OLIVIA\Desktop\DSBA/Dataset
Source code
setwd("C:/Users/OLIVIA/Desktop/DSBA/Dataset/")
IMPORT AND READ THE DATASET

The given dataset is in .csv format. Hence, the command „read.csv‟ is used for importing the file.
Source code
cars_data = read.csv("Cars-dataset.csv", header = T)
VARIABLE IDENTIFICATION
R Functions Purpose
dim To display total number of observations and dimension
names List of all the variables
head To view some top observations for possible missing values.
tail To view some bottom observations for possible missing values.
str To know the data type and structure of present variables
summary Overview of measures of centrality, distribution and dispersion.
head(cars_data)
## Age Gender Engineer MBA Work_Exp Salary Distance License Transport

## 1 28 Male 1 0 5 14.4 5.1 0 0
## 2 24 Male 1 0 6 10.6 6.1 0 0
## 3 27 Female 1 0 9 15.5 6.1 0 0
## 4 25 Male 0 0 1 7.6 6.3 0 0
## 5 25 Female 0 0 3 9.6 6.7 0 0
## 6 21 Male 0 0 3 9.5 7.1 0 0
tail(cars_data)

## 413 29 Female 1 0 6 14.9 17.0 0 0
## 414 29 Male 1 1 8 13.9 17.1 0 0
## 415 25 Male 1 0 3 9.9 17.2 0 0
## 416 27 Female 0 0 4 13.9 17.3 0 0
## 417 26 Male 1 1 2 9.9 17.7 0 0
## 418 23 Male 0 0 3 9.9 17.9 0 0
summary(cars_data)
4
## Age Gender Engineer MBA
## Min. :18.00 Female:121 Min. :0.0000 Min. :0.0000
## 1st Qu.:25.00 Male :297 1st Qu.:0.2500 1st Qu.:0.0000
## Median :27.00 Median :1.0000 Median :0.0000
## Mean :27.33 Mean :0.7488 Mean :0.2614
## 3rd Qu.:29.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :43.00 Max. :1.0000 Max. :1.0000
## NA's :1
## Work_Exp Salary Distance License Transpo
rt
## Min. : 0.000 Min. : 6.500 Min. : 3.20 Min. :0.0000 0:383
## 1st Qu.: 3.000 1st Qu.: 9.625 1st Qu.: 8.60 1st Qu.:0.0000 1: 35
## Median : 5.000 Median :13.000 Median :10.90 Median :0.0000
## Mean : 5.873 Mean :15.418 Mean :11.29 Mean :0.2033
## 3rd Qu.: 8.000 3rd Qu.:14.900 3rd Qu.:13.57 3rd Qu.:0.0000
## Max. :24.000 Max. :57.000 Max. :23.40 Max. :1.0000
##
colSums(is.na(cars_data))
## Age Gender Engineer MBA Work_Exp Salary Distance Lic

ense
## 0 0 0 1 0 0 0
0
## Transport
## 0
Inferences:
1. Dataset has 418 Observations for 9 variables (Age, Gender, Engineer, MBA,
Work Experience, Salary, License and Transport.
2. Data-type: 5 Integer variables, 2 Numeric Variables and 2 Factor Variables.
3. The target variable; “Transport” is a 3-levels factor variable
4. Out of the 418 employees, 121 are female and 297 are male.
5. MBA has a missing value
6. Engineer, MBA and license are numeric but 1 and 0 – Ordinal or

categorical characteristic.
7. Outliers seem to be present in Age, Work.Exp, Salary and Distance
Challenge: The target variable has 3 levels
Solution: Conversion to two levels (car and others)

5
Source Code:
cars_data$Transport = ifelse(cars_data$Transport == "Car", 1, 0)
cars_data$Transport = factor(cars_data$Transport, levels = c(0,1))
VISUALIZATION
Creating table to relate the target variable with the dataset
table(cars_data$Transport)
##
## 0 1
## 383 35
prop.table(table(cars_data$Transport))
##
## 0 1
## 0.91626794 0.08373206
Inference:
1. Employee choosing car as means of transport are 35 - 8.37% -Minority

sample data
2. Employee using other modes of transport (2wheeler or public transport)

are 383 – 91.62% - Majority sample data
UNIVARIATE ANALYSIS
See appendix for the graphical representations.
Inference:
1. Age data is skewed to the right with outliers
2. Engineer and MBA data appeared to be just 0 and 1 like categorical data
3. WorK.Exp and Salary data are right skewed
4. Work.Exp and Salary data have outliers.
5. License data appeared to be just 0 and 1 like categorical data
6. Distance data is almost symmetric with view outliers.
7. Male employees are more than female employees

6
8. Target variable; "Transport" is now 2 levels factor variable
BIVARIATE ANALYSIS
cars_data$Gender = as.numeric(cars_data$Gender)
cars_data$Transport = as.numeric(cars_data$Transport)
cars_corr = cor(cars_data)
corrplot(cars_corr
See appendix for the graphical representation.
Inference:
1. Age, Work Experience, Salary and Transportation are positively correlated.
DATA PREPARATION
MISSING VALUE TREATMENT

7
Option 1: Using mice model
Option 2: Replace NA with 0 -Adopted
cars_data[is.na(cars_data)] = 0
sapply(cars_data, function(x) sum(is.na(x)))
Inference:
1. The NA presents in MBA has no significant effect when it is replaced with

0 (mean with NA is 0.2614 and mean without NA is 0.2608
OUTLIERS TREATMENT
Boxplot method was used.
See appendix for display
Inference:
1. No treatment is required because none of the values is having impossible

value or typo error.
VARIABLE TRANSFORMATION
cars_data$Gender = as.factor(cars_data$Gender)
cars_data$Engineer = as.factor(cars_data$Engineer)
cars_data$MBA = as.factor(cars_data$MBA)
cars_data$License = as.factor(cars_data$License)
cars_data$Transport = as.factor(cars_data$Transport)
DATA SPLITING
set.seed(1234)
sample = sample.split(cars_data$Transport, SplitRatio = 0.7)
train_data = subset(cars_data, sample ==T)
test_data = subset(cars_data, sample == F)
dim(train_data)
## [1] 292 9
8
dim(test_data)
## [1] 126 9
Inference:
The proportion of the Target variable is spread well across train and test
datasets.
1. Train data – 292 observations and 9 variables

2. Test data – 126 observations and 9 variables
3. 8.2% of the employees use car while 92.7% use other modes (Train
data)
4. 24 employees use and 268 employees use other modes.
5. The dataset is imbalanced, that’s the area of interest (employees using
car) is in the minority.
DATA BALANCING -SMOTE
cars_balanced = SMOTE(Transport ~ ., perc.over = 81, data=train_data, k = 3, p

erc.under = 800)
table(cars_balanced$Transport)
Inference:
1. The minority data sample increased to 22%.
MODEL BULDING
APPLYING LOGISTIC REGRESSION
logr_train = cars_balanced
logr_test = test_data
9
logr_model = glm(Transport~ ., data = logr_train, family = binomial)
See appendix
Inference:
Using all independent variables
1. No independent variable shows significance on the target variable

which means there is possibility of multicollinearity.
2. Age, Work_Exp and Salary are highly and positive correlated because
their figures are more than 5.
3. Having removed the above correlated variables Age, Gender, MBA,
License and Distance now their high significant effect on the target
variable.
4. Prediction: Logistic regression model has 95% accuracy in Train dataset but 96%
accuracy in Test dataset which can be considered as good fit
APPLYING KNN
Knn_fit = train(Transport~., data = Knn_train, method ="knn")

Knn_fit
See appendix
Inference:
Prediction: KNN model has 97.9% accuracy on the train dataset.
APPLYING NAÏVE BAYES
NBmodel = naiveBayes(Transport ~., data = nb_train).
See appendix
Inference:
Prediction: Naïve Bayes model has 96.8% accuracy on the train dataset.
10
APPLYING BAGGING
BAModel = bagging(as.numeric(Transport) ~ ., data = bag_train, control =

rpart.control(maxdepth=7, minsplit=8))
See appendix
Inference:
Observation:
Percentage of employees using car is 22% on train data
Percentage of employees using car is 8.7% on test data
Same as SMOTE output
APPLYING BOOSTING –XGBOOST
#XGBModel = xgboost(data = features_train, label = lable_train, eta = 0.7, max

_depth = 5, min_child_weight =3, nrounds = 50, nfold = 5, objective = "binary:
logistic", verbose = 0, early_stopping_rounds = 10)
*See appendix – It ran successfully on the R script (Markdown) but could not knit into word.
Therefore, I commented it.
Inference:
Prediction: Boosting model has 99.5% accuracy on the train dataset.
A C T I O NA B L E I N S I G H T S A N D R E C OM M E N DA T I ON S
Logistic Regression KNN Naïve Bayes Boosting

Sensitivity 86% 95% 97.6% 98%
Specificity 97.3% 97.6% 96% 100%
Accuracy 94.8% 97.9% 96.68% 99.5%
11
Boosting shows that the proportion of actual employees using other mode of transport is 100% correct.
Observations
 Age (older), Salary (high paid), and Work Experience (long serving) are highly influential factors
for the choice of car.
Appendix
12
UNTITLED
unlink("C:/Users/OLIVIA/Documents/R/win-library/3.6/00LOCK", recursive = TRUE)
#Importing all relevant libraries

library(car)
## Warning: package 'car' was built under R version 3.6.3
## Loading required package: carData
library(MASS)
library(dummies)
## dummies-1.5.6 provided by Decision Patterns
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
library(caret)
## Warning: package 'caret' was built under R version 3.6.3
## Loading required package: lattice
library(Information)
## Warning: package 'Information' was built under R version 3.6.3
library(caTools)
## Warning: package 'caTools' was built under R version 3.6.3
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.6.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.6.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':

##
## lowess
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:car':

##
## recode
## The following objects are masked from 'package:stats':

##
## filter, lag
## The following objects are masked from 'package:base':

##
## intersect, setdiff, setequal, union
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.3
## corrplot 0.84 loaded
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 3.6.3
## Registered S3 method overwritten by 'GGally':

## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':

##
## nasa
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.6.3
## Welcome! Want to learn more? See two factoextra-related books at https://go

o.gl/ve3WBa
library(e1071)
## Warning: package 'e1071' was built under R version 3.6.3
library(lattice)
library(mice)
## Warning: package 'mice' was built under R version 3.6.3
2
##
## Attaching package: 'mice'

##
## cbind, rbind
library(xgboost)
## Warning: package 'xgboost' was built under R version 3.6.3
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':

##
## slice
library(class)
library(gbm)
## Warning: package 'gbm' was built under R version 3.6.3
## Loaded gbm 2.1.5
library(ipred)
## Warning: package 'ipred' was built under R version 3.6.3
library(rpart)
library(DMwR)
## Warning: package 'DMwR' was built under R version 3.6.3
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':

## method from
## as.zoo.data.frame zoo
library(rms)
## Warning: package 'rms' was built under R version 3.6.3
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.6.3
## Loading required package: survival
##
## Attaching package: 'survival'
3
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:e1071':

##
## impute
## The following objects are masked from 'package:dplyr':

##
## src, summarize

##
## format.pval, units
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':

##
## backsolve
##
## Attaching package: 'rms'
## The following objects are masked from 'package:car':

##
## Predict, vif
library(forcats)
## Warning: package 'forcats' was built under R version 3.6.3
WO R K I N G D I R E C TO RY S E T U P
setwd("C:/Users/OLIVIA/Desktop/DSBA/Dataset/")
I M P ORT DA TA
cars_data = read.csv("Cars-dataset.csv", header = T)
4
C H A N G E C L OU M N N A M E S
names(cars_data)
## [1] "Age" "Gender" "Engineer" "MBA" "Work.Exp" "Salary"

## [7] "Distance" "license" "Transport"
new_vars = c("Age","Gender","Engineer","MBA","Work_Exp","Salary","Distance","L
icense","Transport")
colnames(cars_data) = new_vars
E X P L O NA TO RY DA TA A NA LY S I S - E DA
dim(cars_data) # Total number of rows and columns
## [1] 418 9
INFERENCE: DATASET HAS 418 OBSERVATIONS FOR 9 VARIABLES (AGE, GENDER, ENGINEER,
MBA, WORK.EXP, SALARY, DISTANCE, LICENSE AND TRANSPORT)
str(cars_data) #View class of each feature along with the internal structure
## 'data.frame': 418 obs. of 9 variables:

## $ Age : int 28 24 27 25 25 21 23 23 24 28 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ...
## $ Engineer : int 1 1 1 0 0 0 1 0 1 1 ...
## $ MBA : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Work_Exp : int 5 6 9 1 3 3 3 0 4 6 ...
## $ Salary : num 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
## $ Distance : num 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
## $ License : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 1 1 1 1 1 1 1 1 1 1 .
..
INFERENCE: TARGET VARIABLE IS A FACTOR WITH 3 LEVELS
C O N V E RT I N G TA RG E T VA R I A B L E T O 2 L E V E L S FA C T OR : 1 F O R C A R A N D 0 F O R OT H E R S
( 2 W H E E L E R A N D T R A N S P O RT )
cars_data$Transport = ifelse(cars_data$Transport == "Car", 1, 0)

cars_data$Transport = factor(cars_data$Transport, levels = c(0,1))
str(cars_data)

## $ Age : int 28 24 27 25 25 21 23 23 24 28 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ...
## $ Engineer : int 1 1 1 0 0 0 1 0 1 1 ...
## $ MBA : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Work_Exp : int 5 6 9 1 3 3 3 0 4 6 ...
## $ Salary : num 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
5
## $ Distance : num 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
## $ License : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Transport: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#Inference: Integer Variables: 5, Numeric Variables: 2 and Factor Variables: 2
M O R E P R E L I M I NA RY A NA LY S I S
head(cars_data)

## 1 28 Male 1 0 5 14.4 5.1 0 0
## 2 24 Male 1 0 6 10.6 6.1 0 0
## 3 27 Female 1 0 9 15.5 6.1 0 0
## 4 25 Male 0 0 1 7.6 6.3 0 0
## 5 25 Female 0 0 3 9.6 6.7 0 0
## 6 21 Male 0 0 3 9.5 7.1 0 0
tail(cars_data)

## 413 29 Female 1 0 6 14.9 17.0 0 0
## 414 29 Male 1 1 8 13.9 17.1 0 0
## 415 25 Male 1 0 3 9.9 17.2 0 0
## 416 27 Female 0 0 4 13.9 17.3 0 0
## 417 26 Male 1 1 2 9.9 17.7 0 0
## 418 23 Male 0 0 3 9.9 17.9 0 0
summary(cars_data)

## Min. :18.00 Female:121 Min. :0.0000 Min. :0.0000
## 1st Qu.:25.00 Male :297 1st Qu.:0.2500 1st Qu.:0.0000
## Median :27.00 Median :1.0000 Median :0.0000
## Mean :27.33 Mean :0.7488 Mean :0.2614
## 3rd Qu.:29.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :43.00 Max. :1.0000 Max. :1.0000
## NA's :1
## Work_Exp Salary Distance License Transpo
rt
## Min. : 0.000 Min. : 6.500 Min. : 3.20 Min. :0.0000 0:383
## 1st Qu.: 3.000 1st Qu.: 9.625 1st Qu.: 8.60 1st Qu.:0.0000 1: 35
## Max. :24.000 Max. :57.000 Max. :23.40 Max. :1.0000
##
colSums(is.na(cars_data))

ense
6
## 0 0 0 1 0 0 0
0
## Transport
## 0
##Inference: #1. Out of the 418 employees, 121 are female and 297 are male. #2. MBA has
missing value #3. Engineer, MBA and licence are numeric but 1 and 0. #4. Outliers seem to be
present in Age, Work.Exp, Salary and Distance
#VISUALIZATION
table(cars_data$Transport)
##
## 0 1
## 383 35
prop.table(table(cars_data$Transport))
##
## 0 1
## 0.91626794 0.08373206
##Inference: #1. Employee choosing car as means of transport are 35 @ 8.37% -Minority sample
data #2. Employee choosing other means of transport (2wheeler/public transport) are 383 @
91.62% - Majorit sample data
#Univariate Analysis - Numeric Variables
hist(cars_data$Age, col ="red", main = "Age")
7
hist(cars_data$Engineer, col ="purple", main = "Engineer")
hist(cars_data$MBA, col ="blue", main = "MBA")
8
boxplot(cars_data$Age, horizontal = T, col = "red", xlab = "Age")
boxplot(cars_data$Engineer, horizontal = T, col ="purple", xlab = "Engineer")
9
boxplot(cars_data$MBA, horizontal = T, col = "blue", xlab = "MBA")
hist(cars_data$Work_Exp, col ="red", main = "Work_Exp")
10
hist(cars_data$Salary, col ="purple", main = "Salary")
boxplot(cars_data$Work_Exp, horizontal = T, col = "red", xlab = "Work_Exp")
11
boxplot(cars_data$Salary, horizontal = T, col ="purple", xlab = "Salary")
hist(cars_data$License, col ="blue", main = "License")
12
hist(cars_data$Distance, col = "orange", main = "Distance")
boxplot(cars_data$License, horizontal = T, col = "blue", xlab = "License")
13
boxplot(cars_data$Distance, horizontal = T, col = "orange", xlab = "Distance")
14
INFERENCE:
#1. Age data is skewed to the right with outliers #2. Engineer and MBA data appeared to be just
0 and 1 like categorical data #3. WorK.Exp and Salary data are right skewed #4. Work.Exp and
Salary data have outliers. #5. license data appeared to be just 0 and 1 like categorical data #6.
Distance data is almost symentric with view outliers.
#Univariate Analysis - Factor Variables (Gender and Transport)
par(mfrow = c(1,1))
plot(cars_data$Gender, col = "red", main = "Gender")
plot(cars_data$Transport, col = "blue", main = "Transport")
15
## Inference: #1.
Male employees are more than female employees #2. Target variable; “Transport” is now 2
levels factor variable
B I VA R I A T E A N A LY S I S
cars_data$Gender = as.numeric(cars_data$Gender)
cars_data$Transport = as.numeric(cars_data$Transport)
cars_corr = cor(cars_data)
corrplot(cars_corr, type = "upper", method = "number")
16
#Inference: Age, Work Experience, Salary and Transportation are positively correlated.
DA TA P R E PA R A T I ON
#Missing value treatment

md.pattern(cars_data)
17
## Age Gender Engineer Work_Exp Salary Distance License Transport MBA
## 417 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 1 0 1
## 0 0 0 0 0 0 0 0 1 1
I N F E R E N C E : 1 N A I N C L OU M N M B A
#Option 2: Replace the NA is the MBA with 0

cars_data[is.na(cars_data)] = 0
sapply(cars_data, function(x) sum(is.na(x)))

ense
## 0 0 0 0 0 0 0
0
## Transport
## 0
summary(cars_data)

## Min. :18.00 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:25.00 1st Qu.:1.000 1st Qu.:0.2500 1st Qu.:0.0000
## Median :27.00 Median :2.000 Median :1.0000 Median :0.0000
## Mean :27.33 Mean :1.711 Mean :0.7488 Mean :0.2608
## 3rd Qu.:29.00 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :43.00 Max. :2.000 Max. :1.0000 Max. :1.0000
18
## Work_Exp Salary Distance License
## Min. : 0.000 Min. : 6.500 Min. : 3.20 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.: 9.625 1st Qu.: 8.60 1st Qu.:0.0000
## Max. :24.000 Max. :57.000 Max. :23.40 Max. :1.0000
## Transport
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.084
## 3rd Qu.:1.000
## Max. :2.000
I N F E R E N C E : T H E NA T H A T P R E S E N T S I N M BA H A S N O S I G N I F I C A N T E F F E C T O N I T S
DA TA S E T ( M E A N W I T H N A I S 0 . 2 6 1 4 A N D M E A N W I T H OU T NA I S 0 . 2 6 0 8 )
OU T L I E R S T R E A T M E N T
COLUMN AGE
boxplot(cars_data$Age, plot = F)$out
## [1] 18 38 38 40 36 40 37 39 40 38 36 39 38 40 39 38 42 40 37 43 40 38 37 3
7 39
## [26] 36 36 18
COLUMN WORK_EXP
boxplot(cars_data$Work_Exp, plot =F)$out
## [1] 19 20 22 16 20 18 21 20 20 16 17 21 18 20 21 19 22 22 19 24 20 19 19 1
9 21
## [26] 16 16 18 16
COLUMN SALARY
boxplot(cars_data$Salary, plot = F)$out
## [1] 23.8 36.9 28.8 37.0 23.8 23.0 48.0 42.0 51.0 45.0 34.0 45.0 42.9 41.0
40.9
## [16] 30.9 41.9 43.0 33.0 36.0 33.0 38.0 46.0 45.0 48.0 35.0 51.0 51.0 55.0
45.0
## [31] 42.0 52.0 38.0 57.0 44.0 45.0 47.0 50.0 36.6 25.9 34.8 28.8 28.7 28.7
34.9
## [46] 23.8 29.9 34.9 24.9 23.9 28.8 23.8
COLUMN DISTANCE
boxplot(cars_data$Distance, plot = F)$out
## [1] 21.3 21.4 21.5 21.5 22.8 23.4
19
INFERENCE: NO TREATMENT OF THESE OUTLIERS IS REQUIRED BECAUSE NONE OF THE
VALUES IS HAVING IMPOSSIBLE VALUE OR ANY TYPO ERROR.
#Creating factor variables

cars_data$Gender = as.factor(cars_data$Gender)
cars_data$Engineer = as.factor(cars_data$Engineer)
cars_data$MBA = as.factor(cars_data$MBA)
cars_data$License = as.factor(cars_data$License)
cars_data$Transport = as.factor(cars_data$Transport)
str(cars_data)

## $ Age : int 28 24 27 25 25 21 23 23 24 28 ...
## $ Gender : Factor w/ 2 levels "1","2": 2 2 1 2 1 2 2 2 2 2 ...
## $ Engineer : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 2 ...
## $ MBA : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ Work_Exp : int 5 6 9 1 3 3 3 0 4 6 ...
## $ Salary : num 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
## $ Distance : num 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
## $ License : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
DA TA S P L I T I N G
set.seed(1234)
sample = sample.split(cars_data$Transport, SplitRatio = 0.7)
train_data = subset(cars_data, sample ==T)
test_data = subset(cars_data, sample == F)
dim(train_data)
## [1] 292 9
dim(test_data)
## [1] 126 9
#Inference: The proportion of the Target varibale is spread well across train and test datasets
##1. Train dataset = 292 observations and 9 variables (70% of 418) ##2. Test dataset = 126
observations and r variables (30% of 418)
prop.table(table(train_data$Transport))
##
## 1 2
## 0.91780822 0.08219178
prop.table(table(test_data$Transport))
20
##
## 1 2
## 0.91269841 0.08730159
#Inference: Choice of transportation ratio ##1. Train data: 8.2% of the company employees use
car and 92.7% use other means (2wheeler & public tranport) ##2. Test data: 8.7% of the
company employee use car and 91.3% use other means (2wheeler & public tranport)
table(train_data$Transport)
##
## 1 2
## 268 24
#Inference: #1. 24 Employees use car and 268 Employes other means on the train data. #2. The
dataset is imbalancing the area of interest (employees using car) is in the minority.
#Applying SMOTE to the Train Dataset
cars_balanced = SMOTE(Transport ~ ., perc.over = 81, data=train_data, k = 3, p

erc.under = 800)
table(cars_balanced$Transport)
##
## 1 2
## 152 43
prop.table(table(cars_balanced$Transport))
##
## 1 2
## 0.7794872 0.2205128
#Inference: ##1. The percentage ratio has increased by reducing the majority and increasing the
minority data-point using SMOTE algorithm. ##2. The minority data-point has now increased
from 8.7% to 22%.
M OD E L BU I L D I N G
#Applying Logistic Regression

logr_train = cars_balanced
logr_test = test_data
logr_model = glm(Transport~ ., data = logr_train, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logr_model)
21
##
## Call:
## glm(formula = Transport ~ ., family = binomial, data = logr_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.271e-05 -2.100e-08 -2.100e-08 -2.100e-08 4.785e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -413.302 330983.397 -0.001 0.999
## Age 5.314 16507.769 0.000 1.000
## Gender2 27.078 36568.137 0.001 0.999
## Engineer1 -17.220 69074.842 0.000 1.000
## MBA1 -31.577 26597.452 -0.001 0.999
## Work_Exp -1.957 14727.803 0.000 1.000
## Salary 3.068 3212.092 0.001 0.999
## Distance 12.966 6397.152 0.002 0.998
## License1 -8.235 23781.297 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.0575e+02 on 194 degrees of freedom
## Residual deviance: 1.4294e-08 on 186 degrees of freedom
## AIC: 18
##
## Number of Fisher Scoring iterations: 25
#Inference: No independent variable shows significance on the target variable which means
there is possibility of multicollinearity.
#Check for multicollinearity
vif(logr_model)
## Age Gender2 Engineer1 MBA1 Work_Exp Salary Distance Lice

nse1
## 29.267325 4.962076 1.770887 3.243572 63.070635 17.095538 4.365911 2.31
2930
#Inference: Age, Work_Experience and Salary are highly and positive correlated because their
figures are more than 5.
R E M OV E A G E , S A L A RY A N D WO R K E X P E R I E N C E
logr_train1 = logr_train[, -c(1,5,6)]

logr_model1 = glm(Transport~ ., data = logr_train1, family = binomial)
summary(logr_model1)
##
## Call:
22
## glm(formula = Transport ~ ., family = binomial, data = logr_train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.68217 -0.20885 -0.05587 -0.00565 2.13556
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.3447 3.5185 -5.214 1.85e-07 ***
## Gender2 1.5977 0.8156 1.959 0.050105 .
## Engineer1 2.9951 1.3867 2.160 0.030785 *
## MBA1 -1.8855 0.8913 -2.115 0.034398 *
## Distance 0.8679 0.1593 5.449 5.07e-08 ***
## License1 2.5794 0.7349 3.510 0.000448 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 205.747 on 194 degrees of freedom
## Residual deviance: 64.093 on 189 degrees of freedom
## AIC: 76.093
##
## Number of Fisher Scoring iterations: 7
#Inference: The model shows the independent variables present; Age, Gender, MBA, License
and Distance now their high signifance effect on the target variable.
P R E D I C T I O N - T R A I N DA TA S E T
logr_pred = predict(logr_model1, newdata = logr_train, type = "response")

logr_TR = table(logr_train$Transport, logr_pred > 0.5)
logr_TR
##
## FALSE TRUE
## 1 148 4
## 2 6 37
TR_tpr = logr_TR[2,2]/(logr_TR[2,1] + logr_TR[2,2])

TR_fpr = logr_TR[1,1]/(logr_TR[1,1] + logr_TR[1,2])
TR_accuracy = sum(diag(logr_TR))/sum(logr_TR)
TRAIN- TRUE POSITIVE
TR_tpr
## [1] 0.8604651
23
T R A I N - FA L S E P O S I T I V E
TR_fpr
## [1] 0.9736842
T R A I N - A C C UA R A C Y
TR_accuracy
## [1] 0.9487179
#Prediction -Test Dataset

logr_pred1 = predict(logr_model1, newdata = logr_test, type = "response")
logr_TE = table(logr_test$Transport, logr_pred1 > 0.5)
logr_TE
##
## FALSE TRUE
## 1 112 3
## 2 2 9
TE_tpr = logr_TE[2,2]/(logr_TE[2,1] + logr_TE[2,2])

TE_fpr = logr_TE[1,1]/(logr_TE[1,1] + logr_TE[1,2])
TE_accuracy = sum(diag(logr_TE))/sum(logr_TE)
T E S T- T R U E P O S I T I V E
TE_tpr
## [1] 0.8181818
T E S T- FA L S E P O S I T I V E
TE_fpr
## [1] 0.973913
T E S T- A C C UA R A C Y
TE_accuracy
## [1] 0.9603175
INFERENCE: LOGISTIC REGRESSION MODEL HAS 95% ACCURACY IN TRAIN DATASET BUT 96%
ACCURACY IN TEST DATASET WHICH CAN BE CONSIDERED AS GOOD FIT.
#Applying KNN Model
24
#Coverting all the factor variables to numeric and then scaling the data
Knn_data = cars_balanced
Knn_data$Gender = as.numeric(Knn_data$Gender)
Knn_data$Engineer = as.numeric(Knn_data$Engineer)
Knn_data$MBA = as.numeric(Knn_data$MBA)
Knn_data$License = as.numeric(Knn_data$License)
Knn_data_test = test_data
Knn_data_test$Gender = as.numeric(Knn_data_test$Gender)
Knn_data_test$Engineer = as.numeric(Knn_data_test$Engineer)
Knn_data_test$MBA = as.numeric(Knn_data_test$MBA)
Knn_data_test$License = as.numeric(Knn_data_test$License)
Knn_train = as.data.frame(cbind(Transport = Knn_data[,9], scale(Knn_data[,-9])

))
Knn_test = as.data.frame(cbind(Transport = Knn_data_test[,9], scale(Knn_data_
test[,-9])))
str(Knn_train)

## $ Transport: num 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : num 0.2183 0.0207 -0.5719 1.0084 -2.1522 ...
## $ Gender : num 0.594 0.594 0.594 0.594 0.594 ...
## $ Engineer : num 0.442 0.442 0.442 0.442 0.442 ...
## $ MBA : num -0.601 -0.601 -0.601 1.654 -0.601 ...
## $ Work_Exp : num 0.0139 0.1773 -0.9663 0.9942 -1.293 ...
## $ Salary : num -0.382 -0.465 -0.85 1.288 -0.986 ...
## $ Distance : num -0.0367 -0.1848 -0.2835 -0.4316 0.0126 ...
## $ License : num -0.681 -0.681 1.461 1.461 -0.681 ...
Knn_train$Transport = as.factor(Knn_train$Transport)
Knn_test$Transport = as.factor(Knn_test$Transport)
#Check the proportion of the data in the Train and Test

Train_Knn = prop.table(table(Knn_train$Transport))
Test_Knn = prop.table(table(Knn_test$Transport))
Train_Knn
##
## 1 2
## 0.7794872 0.2205128
Test_Knn
##
## 1 2
## 0.91269841 0.08730159
25
C H E C K I N G T H E S U M M A RY OF T H E K N N M OD E L
Knn_fit = train(Transport~., data = Knn_train, method ="knn")

Knn_fit
## k-Nearest Neighbors
##
## 195 samples
## 8 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 195, 195, 195, 195, 195, 195, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9657679 0.8942526
## 7 0.9681026 0.9011497
## 9 0.9693036 0.9045171
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
P R E D I C T I O N U S I N G K N N M OD E L F OR T R A I N DA TA S E T
predKNN_fit = predict(Knn_fit, newdata = Knn_train [,-1], type = "raw")

Knn_train_table = table(Knn_train$Transport, predKNN_fit)
Knn_train_tpr = Knn_train_table[2,2]/(Knn_train_table[2,1] + Knn_train_table[2
,2])
Knn_train_fpr = Knn_train_table[1,1]/(Knn_train_table[1,1] + Knn_train_table[1
,2])
Knn_train_accuracy = sum(diag(Knn_train_table))/sum(Knn_train_table)
K N N TA I N - T RU E P OS I T I V E
Knn_train_tpr
## [1] 0.9534884
K N N T R A I N - FA L S E P O S I T I V E
Knn_train_fpr
## [1] 0.9868421
26
KNN TRAIN -ACCURACY
Knn_train_accuracy
## [1] 0.9794872
Knn_train_table
## predKNN_fit
## 1 2
## 1 150 2
## 2 2 41
P R E D I C T I O N U S I N G K N N M OD E L F OR T E S T DA TA S E T
predKNN_fit_test = predict(Knn_fit, newdata = Knn_test [,-1], type = "raw")

Knn_test_table = table(Knn_test$Transport, predKNN_fit_test)
Knn_test_tpr = Knn_test_table[2,2]/(Knn_test_table[2,1] + Knn_test_table[2,2])
Knn_test_fpr = Knn_test_table[1,1]/(Knn_test_table[1,1] + Knn_test_table[1,2])
Knn_test_accuracy = sum(diag(Knn_test_table))/sum(Knn_test_table)
K N N T E S T - T RU E P O S I T I V E
Knn_test_tpr
## [1] 0.8181818
K N N T E S T - FA L S E P OS I T I V E
Knn_test_fpr
## [1] 0.9652174
KNN TEST -ACCURACY
Knn_test_accuracy
## [1] 0.952381
Knn_test_table
## predKNN_fit_test
## 1 2
## 1 111 4
## 2 2 9
INFERENCE: KNN MODEL HAS 97.9% ACCURACY ON THE TRAIN DATASET.
#Applying Naive Bayes Model

27
nb_train = cars_balanced
nb_test = test_data
library(e1071)
NBmodel = naiveBayes(Transport ~., data = nb_train)
P R E D I C T I O N U S I N G T R A I N DA TA S E T
predNB = predict(NBmodel, newdata = nb_train)

table_NB_t = table(nb_train$Transport, predNB)
NB_train_tpr = table_NB_t[2,2]/(table_NB_t[2,1] + table_NB_t[2,2])
NB_train_fpr = table_NB_t[1,1]/(table_NB_t[1,1] + table_NB_t[1,2])
NB_train_accuracy = sum(diag(table_NB_t))/sum(table_NB_t)
NA I V E B AY E S T R A I N - T RU E P O S I T I V E
NB_train_tpr
## [1] 0.9767442
NA I V E B AY E S T R A I N - FA L S E P OS I T I V E
NB_train_fpr
## [1] 0.9605263
NA I V E B AY E S T R A I N - A C C U R A C Y
NB_train_accuracy
## [1] 0.9641026
table_NB_t
## predNB
## 1 2
## 1 146 6
## 2 1 42
P R E D I C T I O N U S I N G T E S T DA TA S E T
predNB = predict(NBmodel, newdata = nb_test)

table_NB = table(nb_test$Transport, predNB)
NB_test_tpr = table_NB[2,2]/(table_NB[2,1] + table_NB[2,2])
NB_test_fpr = table_NB[1,1]/(table_NB[1,1] + table_NB[1,2])
NB_test_accuracy = sum(diag(table_NB))/sum(table_NB)
28
NA I V E B AY E S T R A I N - T RU E P O S I T I V E
NB_test_tpr
## [1] 0.8181818
NA I V E B AY E S T R A I N - FA L S E P OS I T I V E
NB_test_fpr
## [1] 0.9826087
NA I V E B AY E S T R A I N - A C C U R A C Y
NB_test_accuracy
## [1] 0.968254
table_NB
## predNB
## 1 2
## 1 113 2
## 2 2 9
#Inference: 96.8% Accuracy

#BAGGING
bag_train = cars_balanced
bag_test = test_data
str(bag_train)

## $ Age : num 30 29 26 34 18 29 27 24 26 20 ...
## $ Work_Exp : num 8 9 2 14 0 6 8 6 4 2 ...
## $ Salary : num 14.8 13.7 8.6 36.9 6.8 14.6 24.9 10.6 12.9 8.8 ...
## $ Distance : num 12 11.4 11 10.4 12.2 7.6 13 8.4 9.6 8.3 ...
str(bag_test)

## $ Age : int 25 26 24 28 23 28 21 30 24 23 ...
29
## $ Work_Exp : int 3 4 6 10 4 5 3 8 4 0 ...
## $ Salary : num 9.6 12.6 12.7 19.7 11.6 14.8 9.8 14.7 12.7 7.7 ...
## $ Distance : num 6.7 7.5 8.7 9 10.7 10.8 11 11.4 11.7 11.7 ...
C O N V E RT I N G T R A I N A N D T E S T DA TA S E T S TO N U M E R I C
bag_train$Transport = as.numeric(bag_train$Transport)
bag_train$Gender = as.numeric(bag_train$Gender)
bag_train$Engineer = as.numeric(bag_train$Engineer)
bag_train$MBA = as.numeric(bag_train$MBA)
bag_train$License = as.numeric(bag_train$License)
bag_test$Transport = as.numeric(bag_test$Transport)
bag_test$Gender = as.numeric(bag_test$Gender)
bag_test$Engineer = as.numeric(bag_test$Engineer)
bag_test$MBA = as.numeric(bag_test$MBA)
bag_test$License = as.numeric(bag_test$License)
BA G G I N G M OD E L
library(ipred)
library(rpart)
BAModel = bagging(as.numeric(Transport) ~ ., data = bag_train, control = rpart
.control(maxdepth=7, minsplit=8))
#Prediction using Train Dataset

BApred_train = predict(BAModel, bag_train)
tabBA = table(bag_train$Transport, BApred_train > 0.5)
tabBA
##
## TRUE
## 1 152
## 2 43
INFERENCE: PERCENTAGE OF EMPLOYEES USING CAR IS 22%
#Prediction using Test Dataset

BApred_test = predict(BAModel, test_data)
tabBA1 = table(test_data$Transport, BApred_test > 0.5)
tabBA1
##
## TRUE
## 1 115
## 2 11
30
INFERENCE: PERCENTAGE OF EMPLOYEES USING CAR IS 8.7%
B O OS T I N G
BS_train = cars_balanced
BS_test = test_data
C O N V E RT I N G T R A I N A N D T E S T DA TA S E T S TO N U M E R I C
BS_train$Gender = as.numeric(BS_train$Gender)
BS_train$Engineer = as.numeric(BS_train$Engineer)
BS_train$MBA = as.numeric(BS_train$MBA)
BS_train$License = as.numeric(BS_train$License)
BS_test$Gender = as.numeric(BS_test$Gender)
BS_test$Engineer = as.numeric(BS_test$Engineer)
BS_test$MBA = as.numeric(BS_test$MBA)
BS_test$License = as.numeric(BS_test$License)
#Setting up features for the Boosting Model

features_train = as.matrix(BS_train[,1:8])
lable_train = as.matrix(BS_train[,9])
features_test = as.matrix(BS_test[,1:8])
#Boosting Model
#XGBModel = xgboost(data = features_train, label = lable_train, eta = 0.7, max
_depth = 5, min_child_weight =3, nrounds = 50, nfold = 5, objective = "binary:
logistic", verbose = 0, early_stopping_rounds = 10)
#Predict using Train dataset

#XGBpred_train = predict(XGBModel, features_train)
#tabXGB_train = table(BS_train$Transport, XGBpred_train > 0.5)
#tabXGB_train
#Boost_tpr = tabXGB_train[2,2]/(tabXGB_train[2,1] + tabXGB_train[2,2])

#Boost_fpr = tabXGB_train[1,1]/(tabXGB_train[1,1] + tabXGB_train[1,2])
#Boost_accuracy = sum(diag(tabXGB_train))/sum(tabXGB_train)
#Boosting -Train True Positive

#Boost_tpr
#Boosting -Train False Positive

#Boost_fpr
#Boosting -Train Accuracy

#Boost_accuracy
31
#Inference: 99.5% Accuracy
#Predict using Test dataset
#XGBpred_test = predict(XGBModel, features_test)

#tabXGB_test = table(BS_test$Transport, XGBpred_test > 0.5)
#tabXGB_test
#Boost_tpr1 = tabXGB_test[2,2]/(tabXGB_test[2,1] + tabXGB_test[2,2])

#Boost_fpr1 = tabXGB_test[1,1]/(tabXGB_test[1,1] + tabXGB_test[1,2])
#Boost_accuracy1 = sum(diag(tabXGB_test))/sum(tabXGB_test)
#Boosting -Test True Positive

#Boost_tpr1
#Boosting -Test False Positive

#Boost_fpr
#Boosting -Test Accuracy

#Boost_accuracy
32

Project 4 - Cars-Datasets PDF

Uploaded by

Copyright:

Available Formats

Project 4 - Cars-Datasets PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project 4 - Cars-Datasets PDF

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING

A Typical Data exploration activity consists of the following steps:

ENVIRONMENT SET UP AND DATA IMPORT

Install all necessary library packages

IMPORT AND READ THE DATASET

## Age Gender Engineer MBA Work_Exp Salary Distance License Transport

## Age Gender Engineer MBA Work_Exp Salary Distance License Transport

## Age Gender Engineer MBA Work_Exp Salary Distance Lic

2. Data-type: 5 Integer variables, 2 Numeric Variables and 2 Factor Variables.

3. The target variable; “Transport” is a 3-levels factor variable

5. MBA has a missing value

6. Engineer, MBA and license are numeric but 1 and 0 – Ordinal or

7. Outliers seem to be present in Age, Work.Exp, Salary and Distance

Challenge: The target variable has 3 levels

Solution: Conversion to two levels (car and others)

Creating table to relate the target variable with the dataset

1. Employee choosing car as means of transport are 35 - 8.37% -Minority

2. Employee using other modes of transport (2wheeler or public transport)

See appendix for the graphical representations.

1. Age data is skewed to the right with outliers

3. WorK.Exp and Salary data are right skewed

4. Work.Exp and Salary data have outliers.

5. License data appeared to be just 0 and 1 like categorical data

6. Distance data is almost symmetric with view outliers.

7. Male employees are more than female employees

See appendix for the graphical representation.

1. Age, Work Experience, Salary and Transportation are positively correlated.

MISSING VALUE TREATMENT

1. The NA presents in MBA has no significant effect when it is replaced with

See appendix for display

1. No treatment is required because none of the values is having impossible

1. Train data – 292 observations and 9 variables

DATA BALANCING -SMOTE

cars_balanced = SMOTE(Transport ~ ., perc.over = 81, data=train_data, k = 3, p

1. The minority data sample increased to 22%.

APPLYING LOGISTIC REGRESSION

Using all independent variables

1. No independent variable shows significance on the target variable

Knn_fit = train(Transport~., data = Knn_train, method ="knn")

Prediction: KNN model has 97.9% accuracy on the train dataset.

APPLYING NAÏVE BAYES

NBmodel = naiveBayes(Transport ~., data = nb_train).

BAModel = bagging(as.numeric(Transport) ~ ., data = bag_train, control =

Same as SMOTE output

APPLYING BOOSTING –XGBOOST

#XGBModel = xgboost(data = features_train, label = lable_train, eta = 0.7, max

Prediction: Boosting model has 99.5% accuracy on the train dataset.

Logistic Regression KNN Naïve Bayes Boosting

#Importing all relevant libraries

## Warning: package 'car' was built under R version 3.6.3

## Loading required package: carData

## dummies-1.5.6 provided by Decision Patterns

## Warning: package 'ggplot2' was built under R version 3.6.3

## Warning: package 'caret' was built under R version 3.6.3

## Loading required package: lattice

## Warning: package 'Information' was built under R version 3.6.3

## Warning: package 'caTools' was built under R version 3.6.3

## Warning: package 'ROCR' was built under R version 3.6.3

## Loading required package: gplots