Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Project 4 - Cars-Datasets PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

MACHINE LEARNING

P R E D IC T IN G MO DE O F T R A N S P O RT

ASAAJU BABATUNDE
P RO J E C T BAC KG RO U N D

Objective
The objective of this project is to understand what mode of transport the employees of a particular
organisation prefer to commute to workplace. That is to predict whether or not an employee will use
Car as a mode of transport.

This requires the need to explore the available dataset; “Cars-dataset” in R for exploratory analysis,
description, and prediction. The dataset "Cars-dataset" includes employee information about their
mode of transport as well as their personal and professional details like age, salary, work experience…

Assumptions
 Dataset file format is in csv format.
 Response variable is a factor data type with 2 levels
 Dataset has no missing data values

Expectations
 Exploratory Data Analysis -EDA
 Perform an EDA on the data
 Illustrate the insights based on EDA
 What is the most challenging aspect of this problem? What method will you use to deal with
this? Comment (3 marks)
 Data Preparation
 Prepare the data for analysis
 Modeling
 Create multiple models and explore how each model perform using appropriate model
performance metrics
o KNN
o Naive Bayes (is it applicable here? comment and if it is not applicable, how can
you build an NB model in this case?)
o Logistic Regression
 Apply both bagging and boosting modeling procedures to create 2 models and compare its
accuracy with the best model of the above step.
 Actionable Insights & Recommendations
 Summarize your findings from the exercise in a concise yet actionable note

2
EXPLORATORY DATA ANA LY SIS

A Typical Data exploration activity consists of the following steps:


 Environment Set up and Data Import
 Variable Identification
 Visualization
Univariate Analysis
Bi-Variate Analysis

ENVIRONMENT SET UP AND DATA IMPORT

Install all necessary library packages


 library(car)
 library(MASS)
 library(dummies)
 library(ggplot2)
 library(caret)
 library(Information)
 library(caTools)
 library(ROCR)
 library(dplyr)
 library(tidyr)
 library(corrplot)
 library(ggplot2)
 library(GGally)
 library(factoextra)
 library(e1071)
 library(lattice)
 library(mice)
 library(xgboost)
 library(class)
 library(gbm)
 library(ipred)
 library(rpart)
 library(DMwR)
 library(rms)

3
SETUP WORKING DIRECTORY

Setting a working directory on starting of the R session makes importing and exporting data files and
code files easier. Basically, working directory is the location/ folder on the PC where i have the
dataset related to the project..
The path on my computer is C:\Users\OLIVIA\Desktop\DSBA/Dataset

Source code
setwd("C:/Users/OLIVIA/Desktop/DSBA/Dataset/")

IMPORT AND READ THE DATASET


The given dataset is in .csv format. Hence, the command „read.csv‟ is used for importing the file.

Source code
cars_data = read.csv("Cars-dataset.csv", header = T)

VARIABLE IDENTIFICATION

R Functions Purpose
dim To display total number of observations and dimension
names List of all the variables
head To view some top observations for possible missing values.
tail To view some bottom observations for possible missing values.
str To know the data type and structure of present variables
summary Overview of measures of centrality, distribution and dispersion.

head(cars_data)

## Age Gender Engineer MBA Work_Exp Salary Distance License Transport


## 1 28 Male 1 0 5 14.4 5.1 0 0
## 2 24 Male 1 0 6 10.6 6.1 0 0
## 3 27 Female 1 0 9 15.5 6.1 0 0
## 4 25 Male 0 0 1 7.6 6.3 0 0
## 5 25 Female 0 0 3 9.6 6.7 0 0
## 6 21 Male 0 0 3 9.5 7.1 0 0

tail(cars_data)

## Age Gender Engineer MBA Work_Exp Salary Distance License Transport


## 413 29 Female 1 0 6 14.9 17.0 0 0
## 414 29 Male 1 1 8 13.9 17.1 0 0
## 415 25 Male 1 0 3 9.9 17.2 0 0
## 416 27 Female 0 0 4 13.9 17.3 0 0
## 417 26 Male 1 1 2 9.9 17.7 0 0
## 418 23 Male 0 0 3 9.9 17.9 0 0

summary(cars_data)

4
## Age Gender Engineer MBA
## Min. :18.00 Female:121 Min. :0.0000 Min. :0.0000
## 1st Qu.:25.00 Male :297 1st Qu.:0.2500 1st Qu.:0.0000
## Median :27.00 Median :1.0000 Median :0.0000
## Mean :27.33 Mean :0.7488 Mean :0.2614
## 3rd Qu.:29.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :43.00 Max. :1.0000 Max. :1.0000
## NA's :1
## Work_Exp Salary Distance License Transpo
rt
## Min. : 0.000 Min. : 6.500 Min. : 3.20 Min. :0.0000 0:383
## 1st Qu.: 3.000 1st Qu.: 9.625 1st Qu.: 8.60 1st Qu.:0.0000 1: 35
## Median : 5.000 Median :13.000 Median :10.90 Median :0.0000
## Mean : 5.873 Mean :15.418 Mean :11.29 Mean :0.2033
## 3rd Qu.: 8.000 3rd Qu.:14.900 3rd Qu.:13.57 3rd Qu.:0.0000
## Max. :24.000 Max. :57.000 Max. :23.40 Max. :1.0000
##

colSums(is.na(cars_data))

## Age Gender Engineer MBA Work_Exp Salary Distance Lic


ense
## 0 0 0 1 0 0 0
0
## Transport
## 0

Inferences:

1. Dataset has 418 Observations for 9 variables (Age, Gender, Engineer, MBA,
Work Experience, Salary, License and Transport.

2. Data-type: 5 Integer variables, 2 Numeric Variables and 2 Factor Variables.

3. The target variable; “Transport” is a 3-levels factor variable

4. Out of the 418 employees, 121 are female and 297 are male.

5. MBA has a missing value

6. Engineer, MBA and license are numeric but 1 and 0 – Ordinal or


categorical characteristic.

7. Outliers seem to be present in Age, Work.Exp, Salary and Distance

Challenge: The target variable has 3 levels

Solution: Conversion to two levels (car and others)


5
Source Code:
cars_data$Transport = ifelse(cars_data$Transport == "Car", 1, 0)
cars_data$Transport = factor(cars_data$Transport, levels = c(0,1))

VISUALIZATION

Creating table to relate the target variable with the dataset

table(cars_data$Transport)

##
## 0 1
## 383 35

prop.table(table(cars_data$Transport))

##
## 0 1
## 0.91626794 0.08373206

Inference:

1. Employee choosing car as means of transport are 35 - 8.37% -Minority


sample data

2. Employee using other modes of transport (2wheeler or public transport)


are 383 – 91.62% - Majority sample data

UNIVARIATE ANALYSIS

See appendix for the graphical representations.

Inference:

1. Age data is skewed to the right with outliers

2. Engineer and MBA data appeared to be just 0 and 1 like categorical data

3. WorK.Exp and Salary data are right skewed

4. Work.Exp and Salary data have outliers.

5. License data appeared to be just 0 and 1 like categorical data

6. Distance data is almost symmetric with view outliers.

7. Male employees are more than female employees


6
8. Target variable; "Transport" is now 2 levels factor variable

BIVARIATE ANALYSIS

cars_data$Gender = as.numeric(cars_data$Gender)
cars_data$Transport = as.numeric(cars_data$Transport)
cars_corr = cor(cars_data)

corrplot(cars_corr

See appendix for the graphical representation.

Inference:

1. Age, Work Experience, Salary and Transportation are positively correlated.

DATA PREPARATION

MISSING VALUE TREATMENT


7
Option 1: Using mice model
Option 2: Replace NA with 0 -Adopted

cars_data[is.na(cars_data)] = 0
sapply(cars_data, function(x) sum(is.na(x)))

Inference:

1. The NA presents in MBA has no significant effect when it is replaced with


0 (mean with NA is 0.2614 and mean without NA is 0.2608

OUTLIERS TREATMENT
Boxplot method was used.

See appendix for display

Inference:

1. No treatment is required because none of the values is having impossible


value or typo error.

VARIABLE TRANSFORMATION
cars_data$Gender = as.factor(cars_data$Gender)
cars_data$Engineer = as.factor(cars_data$Engineer)
cars_data$MBA = as.factor(cars_data$MBA)
cars_data$License = as.factor(cars_data$License)
cars_data$Transport = as.factor(cars_data$Transport)

DATA SPLITING

set.seed(1234)
sample = sample.split(cars_data$Transport, SplitRatio = 0.7)
train_data = subset(cars_data, sample ==T)
test_data = subset(cars_data, sample == F)

dim(train_data)

## [1] 292 9

8
dim(test_data)

## [1] 126 9

Inference:

The proportion of the Target variable is spread well across train and test
datasets.

1. Train data – 292 observations and 9 variables


2. Test data – 126 observations and 9 variables
3. 8.2% of the employees use car while 92.7% use other modes (Train
data)
4. 24 employees use and 268 employees use other modes.
5. The dataset is imbalanced, that’s the area of interest (employees using
car) is in the minority.

DATA BALANCING -SMOTE

cars_balanced = SMOTE(Transport ~ ., perc.over = 81, data=train_data, k = 3, p


erc.under = 800)
table(cars_balanced$Transport)

Inference:

1. The minority data sample increased to 22%.

MODEL BULDING

APPLYING LOGISTIC REGRESSION

logr_train = cars_balanced
logr_test = test_data
9
logr_model = glm(Transport~ ., data = logr_train, family = binomial)

See appendix

Inference:

Using all independent variables

1. No independent variable shows significance on the target variable


which means there is possibility of multicollinearity.
2. Age, Work_Exp and Salary are highly and positive correlated because
their figures are more than 5.
3. Having removed the above correlated variables Age, Gender, MBA,
License and Distance now their high significant effect on the target
variable.
4. Prediction: Logistic regression model has 95% accuracy in Train dataset but 96%
accuracy in Test dataset which can be considered as good fit

APPLYING KNN

Knn_fit = train(Transport~., data = Knn_train, method ="knn")


Knn_fit

See appendix

Inference:

Prediction: KNN model has 97.9% accuracy on the train dataset.

APPLYING NAÏVE BAYES

NBmodel = naiveBayes(Transport ~., data = nb_train).

See appendix

Inference:

Prediction: Naïve Bayes model has 96.8% accuracy on the train dataset.

10
APPLYING BAGGING

BAModel = bagging(as.numeric(Transport) ~ ., data = bag_train, control =


rpart.control(maxdepth=7, minsplit=8))

See appendix

Inference:

Observation:
Percentage of employees using car is 22% on train data
Percentage of employees using car is 8.7% on test data

Same as SMOTE output

APPLYING BOOSTING –XGBOOST

#XGBModel = xgboost(data = features_train, label = lable_train, eta = 0.7, max


_depth = 5, min_child_weight =3, nrounds = 50, nfold = 5, objective = "binary:
logistic", verbose = 0, early_stopping_rounds = 10)

*See appendix – It ran successfully on the R script (Markdown) but could not knit into word.
Therefore, I commented it.

Inference:

Prediction: Boosting model has 99.5% accuracy on the train dataset.

A C T I O NA B L E I N S I G H T S A N D R E C OM M E N DA T I ON S

Logistic Regression KNN Naïve Bayes Boosting


Sensitivity 86% 95% 97.6% 98%
Specificity 97.3% 97.6% 96% 100%
Accuracy 94.8% 97.9% 96.68% 99.5%

11
Boosting shows that the proportion of actual employees using other mode of transport is 100% correct.
Observations
 Age (older), Salary (high paid), and Work Experience (long serving) are highly influential factors
for the choice of car.

Appendix

12
UNTITLED
unlink("C:/Users/OLIVIA/Documents/R/win-library/3.6/00LOCK", recursive = TRUE)

#Importing all relevant libraries


library(car)

## Warning: package 'car' was built under R version 3.6.3

## Loading required package: carData

library(MASS)
library(dummies)

## dummies-1.5.6 provided by Decision Patterns

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

library(caret)

## Warning: package 'caret' was built under R version 3.6.3

## Loading required package: lattice

library(Information)

## Warning: package 'Information' was built under R version 3.6.3

library(caTools)

## Warning: package 'caTools' was built under R version 3.6.3

library(ROCR)

## Warning: package 'ROCR' was built under R version 3.6.3

## Loading required package: gplots

## Warning: package 'gplots' was built under R version 3.6.3

##
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':


##
## lowess

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.3

##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select

## The following object is masked from 'package:car':


##
## recode

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

library(tidyr)
library(corrplot)

## Warning: package 'corrplot' was built under R version 3.6.3

## corrplot 0.84 loaded

library(ggplot2)
library(GGally)

## Warning: package 'GGally' was built under R version 3.6.3

## Registered S3 method overwritten by 'GGally':


## method from
## +.gg ggplot2

##
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':


##
## nasa

library(factoextra)

## Warning: package 'factoextra' was built under R version 3.6.3

## Welcome! Want to learn more? See two factoextra-related books at https://go


o.gl/ve3WBa

library(e1071)

## Warning: package 'e1071' was built under R version 3.6.3

library(lattice)
library(mice)

## Warning: package 'mice' was built under R version 3.6.3

2
##
## Attaching package: 'mice'

## The following objects are masked from 'package:base':


##
## cbind, rbind

library(xgboost)

## Warning: package 'xgboost' was built under R version 3.6.3

##
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':


##
## slice

library(class)
library(gbm)

## Warning: package 'gbm' was built under R version 3.6.3

## Loaded gbm 2.1.5

library(ipred)

## Warning: package 'ipred' was built under R version 3.6.3

library(rpart)
library(DMwR)

## Warning: package 'DMwR' was built under R version 3.6.3

## Loading required package: grid

## Registered S3 method overwritten by 'quantmod':


## method from
## as.zoo.data.frame zoo

library(rms)

## Warning: package 'rms' was built under R version 3.6.3

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 3.6.3

## Loading required package: survival

##
## Attaching package: 'survival'

3
## The following object is masked from 'package:caret':
##
## cluster

## Loading required package: Formula

##
## Attaching package: 'Hmisc'

## The following object is masked from 'package:e1071':


##
## impute

## The following objects are masked from 'package:dplyr':


##
## src, summarize

## The following objects are masked from 'package:base':


##
## format.pval, units

## Loading required package: SparseM

##
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':


##
## backsolve

##
## Attaching package: 'rms'

## The following objects are masked from 'package:car':


##
## Predict, vif

library(forcats)

## Warning: package 'forcats' was built under R version 3.6.3

WO R K I N G D I R E C TO RY S E T U P

setwd("C:/Users/OLIVIA/Desktop/DSBA/Dataset/")

I M P ORT DA TA

cars_data = read.csv("Cars-dataset.csv", header = T)

4
C H A N G E C L OU M N N A M E S

names(cars_data)

## [1] "Age" "Gender" "Engineer" "MBA" "Work.Exp" "Salary"


## [7] "Distance" "license" "Transport"

new_vars = c("Age","Gender","Engineer","MBA","Work_Exp","Salary","Distance","L
icense","Transport")
colnames(cars_data) = new_vars

E X P L O NA TO RY DA TA A NA LY S I S - E DA

dim(cars_data) # Total number of rows and columns

## [1] 418 9

INFERENCE: DATASET HAS 418 OBSERVATIONS FOR 9 VARIABLES (AGE, GENDER, ENGINEER,
MBA, WORK.EXP, SALARY, DISTANCE, LICENSE AND TRANSPORT)

str(cars_data) #View class of each feature along with the internal structure

## 'data.frame': 418 obs. of 9 variables:


## $ Age : int 28 24 27 25 25 21 23 23 24 28 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ...
## $ Engineer : int 1 1 1 0 0 0 1 0 1 1 ...
## $ MBA : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Work_Exp : int 5 6 9 1 3 3 3 0 4 6 ...
## $ Salary : num 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
## $ Distance : num 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
## $ License : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 1 1 1 1 1 1 1 1 1 1 .
..

INFERENCE: TARGET VARIABLE IS A FACTOR WITH 3 LEVELS

C O N V E RT I N G TA RG E T VA R I A B L E T O 2 L E V E L S FA C T OR : 1 F O R C A R A N D 0 F O R OT H E R S
( 2 W H E E L E R A N D T R A N S P O RT )

cars_data$Transport = ifelse(cars_data$Transport == "Car", 1, 0)


cars_data$Transport = factor(cars_data$Transport, levels = c(0,1))

str(cars_data)

## 'data.frame': 418 obs. of 9 variables:


## $ Age : int 28 24 27 25 25 21 23 23 24 28 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ...
## $ Engineer : int 1 1 1 0 0 0 1 0 1 1 ...
## $ MBA : int 0 0 0 0 0 0 1 0 0 0 ...
## $ Work_Exp : int 5 6 9 1 3 3 3 0 4 6 ...
## $ Salary : num 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
5
## $ Distance : num 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
## $ License : int 0 0 0 0 0 0 0 0 0 1 ...
## $ Transport: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

#Inference: Integer Variables: 5, Numeric Variables: 2 and Factor Variables: 2

M O R E P R E L I M I NA RY A NA LY S I S

head(cars_data)

## Age Gender Engineer MBA Work_Exp Salary Distance License Transport


## 1 28 Male 1 0 5 14.4 5.1 0 0
## 2 24 Male 1 0 6 10.6 6.1 0 0
## 3 27 Female 1 0 9 15.5 6.1 0 0
## 4 25 Male 0 0 1 7.6 6.3 0 0
## 5 25 Female 0 0 3 9.6 6.7 0 0
## 6 21 Male 0 0 3 9.5 7.1 0 0

tail(cars_data)

## Age Gender Engineer MBA Work_Exp Salary Distance License Transport


## 413 29 Female 1 0 6 14.9 17.0 0 0
## 414 29 Male 1 1 8 13.9 17.1 0 0
## 415 25 Male 1 0 3 9.9 17.2 0 0
## 416 27 Female 0 0 4 13.9 17.3 0 0
## 417 26 Male 1 1 2 9.9 17.7 0 0
## 418 23 Male 0 0 3 9.9 17.9 0 0

summary(cars_data)

## Age Gender Engineer MBA


## Min. :18.00 Female:121 Min. :0.0000 Min. :0.0000
## 1st Qu.:25.00 Male :297 1st Qu.:0.2500 1st Qu.:0.0000
## Median :27.00 Median :1.0000 Median :0.0000
## Mean :27.33 Mean :0.7488 Mean :0.2614
## 3rd Qu.:29.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :43.00 Max. :1.0000 Max. :1.0000
## NA's :1
## Work_Exp Salary Distance License Transpo
rt
## Min. : 0.000 Min. : 6.500 Min. : 3.20 Min. :0.0000 0:383
## 1st Qu.: 3.000 1st Qu.: 9.625 1st Qu.: 8.60 1st Qu.:0.0000 1: 35
## Median : 5.000 Median :13.000 Median :10.90 Median :0.0000
## Mean : 5.873 Mean :15.418 Mean :11.29 Mean :0.2033
## 3rd Qu.: 8.000 3rd Qu.:14.900 3rd Qu.:13.57 3rd Qu.:0.0000
## Max. :24.000 Max. :57.000 Max. :23.40 Max. :1.0000
##

colSums(is.na(cars_data))

## Age Gender Engineer MBA Work_Exp Salary Distance Lic


ense
6
## 0 0 0 1 0 0 0
0
## Transport
## 0

##Inference: #1. Out of the 418 employees, 121 are female and 297 are male. #2. MBA has
missing value #3. Engineer, MBA and licence are numeric but 1 and 0. #4. Outliers seem to be
present in Age, Work.Exp, Salary and Distance
#VISUALIZATION

table(cars_data$Transport)

##
## 0 1
## 383 35

prop.table(table(cars_data$Transport))

##
## 0 1
## 0.91626794 0.08373206

##Inference: #1. Employee choosing car as means of transport are 35 @ 8.37% -Minority sample
data #2. Employee choosing other means of transport (2wheeler/public transport) are 383 @
91.62% - Majorit sample data
#Univariate Analysis - Numeric Variables

hist(cars_data$Age, col ="red", main = "Age")

7
hist(cars_data$Engineer, col ="purple", main = "Engineer")

hist(cars_data$MBA, col ="blue", main = "MBA")

8
boxplot(cars_data$Age, horizontal = T, col = "red", xlab = "Age")

boxplot(cars_data$Engineer, horizontal = T, col ="purple", xlab = "Engineer")

9
boxplot(cars_data$MBA, horizontal = T, col = "blue", xlab = "MBA")

hist(cars_data$Work_Exp, col ="red", main = "Work_Exp")

10
hist(cars_data$Salary, col ="purple", main = "Salary")

boxplot(cars_data$Work_Exp, horizontal = T, col = "red", xlab = "Work_Exp")

11
boxplot(cars_data$Salary, horizontal = T, col ="purple", xlab = "Salary")

hist(cars_data$License, col ="blue", main = "License")

12
hist(cars_data$Distance, col = "orange", main = "Distance")

boxplot(cars_data$License, horizontal = T, col = "blue", xlab = "License")

13
boxplot(cars_data$Distance, horizontal = T, col = "orange", xlab = "Distance")

14
INFERENCE:

#1. Age data is skewed to the right with outliers #2. Engineer and MBA data appeared to be just
0 and 1 like categorical data #3. WorK.Exp and Salary data are right skewed #4. Work.Exp and
Salary data have outliers. #5. license data appeared to be just 0 and 1 like categorical data #6.
Distance data is almost symentric with view outliers.
#Univariate Analysis - Factor Variables (Gender and Transport)

par(mfrow = c(1,1))

plot(cars_data$Gender, col = "red", main = "Gender")

plot(cars_data$Transport, col = "blue", main = "Transport")

15
## Inference: #1.
Male employees are more than female employees #2. Target variable; “Transport” is now 2
levels factor variable

B I VA R I A T E A N A LY S I S

cars_data$Gender = as.numeric(cars_data$Gender)
cars_data$Transport = as.numeric(cars_data$Transport)
cars_corr = cor(cars_data)

corrplot(cars_corr, type = "upper", method = "number")

16
#Inference: Age, Work Experience, Salary and Transportation are positively correlated.

DA TA P R E PA R A T I ON

#Missing value treatment


md.pattern(cars_data)

17
## Age Gender Engineer Work_Exp Salary Distance License Transport MBA
## 417 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 1 0 1
## 0 0 0 0 0 0 0 0 1 1

I N F E R E N C E : 1 N A I N C L OU M N M B A

#Option 2: Replace the NA is the MBA with 0


cars_data[is.na(cars_data)] = 0
sapply(cars_data, function(x) sum(is.na(x)))

## Age Gender Engineer MBA Work_Exp Salary Distance Lic


ense
## 0 0 0 0 0 0 0
0
## Transport
## 0

summary(cars_data)

## Age Gender Engineer MBA


## Min. :18.00 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:25.00 1st Qu.:1.000 1st Qu.:0.2500 1st Qu.:0.0000
## Median :27.00 Median :2.000 Median :1.0000 Median :0.0000
## Mean :27.33 Mean :1.711 Mean :0.7488 Mean :0.2608
## 3rd Qu.:29.00 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :43.00 Max. :2.000 Max. :1.0000 Max. :1.0000
18
## Work_Exp Salary Distance License
## Min. : 0.000 Min. : 6.500 Min. : 3.20 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.: 9.625 1st Qu.: 8.60 1st Qu.:0.0000
## Median : 5.000 Median :13.000 Median :10.90 Median :0.0000
## Mean : 5.873 Mean :15.418 Mean :11.29 Mean :0.2033
## 3rd Qu.: 8.000 3rd Qu.:14.900 3rd Qu.:13.57 3rd Qu.:0.0000
## Max. :24.000 Max. :57.000 Max. :23.40 Max. :1.0000
## Transport
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.084
## 3rd Qu.:1.000
## Max. :2.000

I N F E R E N C E : T H E NA T H A T P R E S E N T S I N M BA H A S N O S I G N I F I C A N T E F F E C T O N I T S
DA TA S E T ( M E A N W I T H N A I S 0 . 2 6 1 4 A N D M E A N W I T H OU T NA I S 0 . 2 6 0 8 )

OU T L I E R S T R E A T M E N T

COLUMN AGE

boxplot(cars_data$Age, plot = F)$out

## [1] 18 38 38 40 36 40 37 39 40 38 36 39 38 40 39 38 42 40 37 43 40 38 37 3
7 39
## [26] 36 36 18

COLUMN WORK_EXP

boxplot(cars_data$Work_Exp, plot =F)$out

## [1] 19 20 22 16 20 18 21 20 20 16 17 21 18 20 21 19 22 22 19 24 20 19 19 1
9 21
## [26] 16 16 18 16

COLUMN SALARY

boxplot(cars_data$Salary, plot = F)$out

## [1] 23.8 36.9 28.8 37.0 23.8 23.0 48.0 42.0 51.0 45.0 34.0 45.0 42.9 41.0
40.9
## [16] 30.9 41.9 43.0 33.0 36.0 33.0 38.0 46.0 45.0 48.0 35.0 51.0 51.0 55.0
45.0
## [31] 42.0 52.0 38.0 57.0 44.0 45.0 47.0 50.0 36.6 25.9 34.8 28.8 28.7 28.7
34.9
## [46] 23.8 29.9 34.9 24.9 23.9 28.8 23.8

COLUMN DISTANCE

boxplot(cars_data$Distance, plot = F)$out

## [1] 21.3 21.4 21.5 21.5 22.8 23.4

19
INFERENCE: NO TREATMENT OF THESE OUTLIERS IS REQUIRED BECAUSE NONE OF THE
VALUES IS HAVING IMPOSSIBLE VALUE OR ANY TYPO ERROR.

#Creating factor variables


cars_data$Gender = as.factor(cars_data$Gender)
cars_data$Engineer = as.factor(cars_data$Engineer)
cars_data$MBA = as.factor(cars_data$MBA)
cars_data$License = as.factor(cars_data$License)
cars_data$Transport = as.factor(cars_data$Transport)

str(cars_data)

## 'data.frame': 418 obs. of 9 variables:


## $ Age : int 28 24 27 25 25 21 23 23 24 28 ...
## $ Gender : Factor w/ 2 levels "1","2": 2 2 1 2 1 2 2 2 2 2 ...
## $ Engineer : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 2 ...
## $ MBA : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ Work_Exp : int 5 6 9 1 3 3 3 0 4 6 ...
## $ Salary : num 14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
## $ Distance : num 5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
## $ License : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
## $ Transport: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...

DA TA S P L I T I N G

set.seed(1234)
sample = sample.split(cars_data$Transport, SplitRatio = 0.7)
train_data = subset(cars_data, sample ==T)
test_data = subset(cars_data, sample == F)

dim(train_data)

## [1] 292 9

dim(test_data)

## [1] 126 9

#Inference: The proportion of the Target varibale is spread well across train and test datasets
##1. Train dataset = 292 observations and 9 variables (70% of 418) ##2. Test dataset = 126
observations and r variables (30% of 418)
prop.table(table(train_data$Transport))

##
## 1 2
## 0.91780822 0.08219178

prop.table(table(test_data$Transport))

20
##
## 1 2
## 0.91269841 0.08730159

#Inference: Choice of transportation ratio ##1. Train data: 8.2% of the company employees use
car and 92.7% use other means (2wheeler & public tranport) ##2. Test data: 8.7% of the
company employee use car and 91.3% use other means (2wheeler & public tranport)
table(train_data$Transport)

##
## 1 2
## 268 24

#Inference: #1. 24 Employees use car and 268 Employes other means on the train data. #2. The
dataset is imbalancing the area of interest (employees using car) is in the minority.
#Applying SMOTE to the Train Dataset

cars_balanced = SMOTE(Transport ~ ., perc.over = 81, data=train_data, k = 3, p


erc.under = 800)
table(cars_balanced$Transport)

##
## 1 2
## 152 43

prop.table(table(cars_balanced$Transport))

##
## 1 2
## 0.7794872 0.2205128

#Inference: ##1. The percentage ratio has increased by reducing the majority and increasing the
minority data-point using SMOTE algorithm. ##2. The minority data-point has now increased
from 8.7% to 22%.

M OD E L BU I L D I N G

#Applying Logistic Regression


logr_train = cars_balanced
logr_test = test_data

logr_model = glm(Transport~ ., data = logr_train, family = binomial)

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logr_model)

21
##
## Call:
## glm(formula = Transport ~ ., family = binomial, data = logr_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.271e-05 -2.100e-08 -2.100e-08 -2.100e-08 4.785e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -413.302 330983.397 -0.001 0.999
## Age 5.314 16507.769 0.000 1.000
## Gender2 27.078 36568.137 0.001 0.999
## Engineer1 -17.220 69074.842 0.000 1.000
## MBA1 -31.577 26597.452 -0.001 0.999
## Work_Exp -1.957 14727.803 0.000 1.000
## Salary 3.068 3212.092 0.001 0.999
## Distance 12.966 6397.152 0.002 0.998
## License1 -8.235 23781.297 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.0575e+02 on 194 degrees of freedom
## Residual deviance: 1.4294e-08 on 186 degrees of freedom
## AIC: 18
##
## Number of Fisher Scoring iterations: 25

#Inference: No independent variable shows significance on the target variable which means
there is possibility of multicollinearity.
#Check for multicollinearity

vif(logr_model)

## Age Gender2 Engineer1 MBA1 Work_Exp Salary Distance Lice


nse1
## 29.267325 4.962076 1.770887 3.243572 63.070635 17.095538 4.365911 2.31
2930

#Inference: Age, Work_Experience and Salary are highly and positive correlated because their
figures are more than 5.

R E M OV E A G E , S A L A RY A N D WO R K E X P E R I E N C E

logr_train1 = logr_train[, -c(1,5,6)]


logr_model1 = glm(Transport~ ., data = logr_train1, family = binomial)
summary(logr_model1)

##
## Call:
22
## glm(formula = Transport ~ ., family = binomial, data = logr_train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.68217 -0.20885 -0.05587 -0.00565 2.13556
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.3447 3.5185 -5.214 1.85e-07 ***
## Gender2 1.5977 0.8156 1.959 0.050105 .
## Engineer1 2.9951 1.3867 2.160 0.030785 *
## MBA1 -1.8855 0.8913 -2.115 0.034398 *
## Distance 0.8679 0.1593 5.449 5.07e-08 ***
## License1 2.5794 0.7349 3.510 0.000448 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 205.747 on 194 degrees of freedom
## Residual deviance: 64.093 on 189 degrees of freedom
## AIC: 76.093
##
## Number of Fisher Scoring iterations: 7

#Inference: The model shows the independent variables present; Age, Gender, MBA, License
and Distance now their high signifance effect on the target variable.

P R E D I C T I O N - T R A I N DA TA S E T

logr_pred = predict(logr_model1, newdata = logr_train, type = "response")


logr_TR = table(logr_train$Transport, logr_pred > 0.5)
logr_TR

##
## FALSE TRUE
## 1 148 4
## 2 6 37

TR_tpr = logr_TR[2,2]/(logr_TR[2,1] + logr_TR[2,2])


TR_fpr = logr_TR[1,1]/(logr_TR[1,1] + logr_TR[1,2])
TR_accuracy = sum(diag(logr_TR))/sum(logr_TR)

TRAIN- TRUE POSITIVE

TR_tpr

## [1] 0.8604651

23
T R A I N - FA L S E P O S I T I V E

TR_fpr

## [1] 0.9736842

T R A I N - A C C UA R A C Y

TR_accuracy

## [1] 0.9487179

#Prediction -Test Dataset


logr_pred1 = predict(logr_model1, newdata = logr_test, type = "response")
logr_TE = table(logr_test$Transport, logr_pred1 > 0.5)
logr_TE

##
## FALSE TRUE
## 1 112 3
## 2 2 9

TE_tpr = logr_TE[2,2]/(logr_TE[2,1] + logr_TE[2,2])


TE_fpr = logr_TE[1,1]/(logr_TE[1,1] + logr_TE[1,2])
TE_accuracy = sum(diag(logr_TE))/sum(logr_TE)

T E S T- T R U E P O S I T I V E

TE_tpr

## [1] 0.8181818

T E S T- FA L S E P O S I T I V E

TE_fpr

## [1] 0.973913

T E S T- A C C UA R A C Y

TE_accuracy

## [1] 0.9603175

INFERENCE: LOGISTIC REGRESSION MODEL HAS 95% ACCURACY IN TRAIN DATASET BUT 96%
ACCURACY IN TEST DATASET WHICH CAN BE CONSIDERED AS GOOD FIT.

#Applying KNN Model

24
#Coverting all the factor variables to numeric and then scaling the data
Knn_data = cars_balanced
Knn_data$Gender = as.numeric(Knn_data$Gender)
Knn_data$Engineer = as.numeric(Knn_data$Engineer)
Knn_data$MBA = as.numeric(Knn_data$MBA)
Knn_data$License = as.numeric(Knn_data$License)

Knn_data_test = test_data
Knn_data_test$Gender = as.numeric(Knn_data_test$Gender)
Knn_data_test$Engineer = as.numeric(Knn_data_test$Engineer)
Knn_data_test$MBA = as.numeric(Knn_data_test$MBA)
Knn_data_test$License = as.numeric(Knn_data_test$License)

Knn_train = as.data.frame(cbind(Transport = Knn_data[,9], scale(Knn_data[,-9])


))
Knn_test = as.data.frame(cbind(Transport = Knn_data_test[,9], scale(Knn_data_
test[,-9])))

str(Knn_train)

## 'data.frame': 195 obs. of 9 variables:


## $ Transport: num 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : num 0.2183 0.0207 -0.5719 1.0084 -2.1522 ...
## $ Gender : num 0.594 0.594 0.594 0.594 0.594 ...
## $ Engineer : num 0.442 0.442 0.442 0.442 0.442 ...
## $ MBA : num -0.601 -0.601 -0.601 1.654 -0.601 ...
## $ Work_Exp : num 0.0139 0.1773 -0.9663 0.9942 -1.293 ...
## $ Salary : num -0.382 -0.465 -0.85 1.288 -0.986 ...
## $ Distance : num -0.0367 -0.1848 -0.2835 -0.4316 0.0126 ...
## $ License : num -0.681 -0.681 1.461 1.461 -0.681 ...

Knn_train$Transport = as.factor(Knn_train$Transport)
Knn_test$Transport = as.factor(Knn_test$Transport)

#Check the proportion of the data in the Train and Test


Train_Knn = prop.table(table(Knn_train$Transport))
Test_Knn = prop.table(table(Knn_test$Transport))

Train_Knn

##
## 1 2
## 0.7794872 0.2205128

Test_Knn

##
## 1 2
## 0.91269841 0.08730159

25
C H E C K I N G T H E S U M M A RY OF T H E K N N M OD E L

Knn_fit = train(Transport~., data = Knn_train, method ="knn")


Knn_fit

## k-Nearest Neighbors
##
## 195 samples
## 8 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 195, 195, 195, 195, 195, 195, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9657679 0.8942526
## 7 0.9681026 0.9011497
## 9 0.9693036 0.9045171
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

P R E D I C T I O N U S I N G K N N M OD E L F OR T R A I N DA TA S E T

predKNN_fit = predict(Knn_fit, newdata = Knn_train [,-1], type = "raw")


Knn_train_table = table(Knn_train$Transport, predKNN_fit)
Knn_train_tpr = Knn_train_table[2,2]/(Knn_train_table[2,1] + Knn_train_table[2
,2])
Knn_train_fpr = Knn_train_table[1,1]/(Knn_train_table[1,1] + Knn_train_table[1
,2])
Knn_train_accuracy = sum(diag(Knn_train_table))/sum(Knn_train_table)

K N N TA I N - T RU E P OS I T I V E

Knn_train_tpr

## [1] 0.9534884

K N N T R A I N - FA L S E P O S I T I V E

Knn_train_fpr

## [1] 0.9868421

26
KNN TRAIN -ACCURACY

Knn_train_accuracy

## [1] 0.9794872

Knn_train_table

## predKNN_fit
## 1 2
## 1 150 2
## 2 2 41

P R E D I C T I O N U S I N G K N N M OD E L F OR T E S T DA TA S E T

predKNN_fit_test = predict(Knn_fit, newdata = Knn_test [,-1], type = "raw")


Knn_test_table = table(Knn_test$Transport, predKNN_fit_test)
Knn_test_tpr = Knn_test_table[2,2]/(Knn_test_table[2,1] + Knn_test_table[2,2])
Knn_test_fpr = Knn_test_table[1,1]/(Knn_test_table[1,1] + Knn_test_table[1,2])
Knn_test_accuracy = sum(diag(Knn_test_table))/sum(Knn_test_table)

K N N T E S T - T RU E P O S I T I V E

Knn_test_tpr

## [1] 0.8181818

K N N T E S T - FA L S E P OS I T I V E

Knn_test_fpr

## [1] 0.9652174

KNN TEST -ACCURACY

Knn_test_accuracy

## [1] 0.952381

Knn_test_table

## predKNN_fit_test
## 1 2
## 1 111 4
## 2 2 9

INFERENCE: KNN MODEL HAS 97.9% ACCURACY ON THE TRAIN DATASET.

#Applying Naive Bayes Model


27
nb_train = cars_balanced
nb_test = test_data

library(e1071)
NBmodel = naiveBayes(Transport ~., data = nb_train)

P R E D I C T I O N U S I N G T R A I N DA TA S E T

predNB = predict(NBmodel, newdata = nb_train)


table_NB_t = table(nb_train$Transport, predNB)
NB_train_tpr = table_NB_t[2,2]/(table_NB_t[2,1] + table_NB_t[2,2])
NB_train_fpr = table_NB_t[1,1]/(table_NB_t[1,1] + table_NB_t[1,2])
NB_train_accuracy = sum(diag(table_NB_t))/sum(table_NB_t)

NA I V E B AY E S T R A I N - T RU E P O S I T I V E

NB_train_tpr

## [1] 0.9767442

NA I V E B AY E S T R A I N - FA L S E P OS I T I V E

NB_train_fpr

## [1] 0.9605263

NA I V E B AY E S T R A I N - A C C U R A C Y

NB_train_accuracy

## [1] 0.9641026

table_NB_t

## predNB
## 1 2
## 1 146 6
## 2 1 42

P R E D I C T I O N U S I N G T E S T DA TA S E T

predNB = predict(NBmodel, newdata = nb_test)


table_NB = table(nb_test$Transport, predNB)
NB_test_tpr = table_NB[2,2]/(table_NB[2,1] + table_NB[2,2])
NB_test_fpr = table_NB[1,1]/(table_NB[1,1] + table_NB[1,2])
NB_test_accuracy = sum(diag(table_NB))/sum(table_NB)

28
NA I V E B AY E S T R A I N - T RU E P O S I T I V E

NB_test_tpr

## [1] 0.8181818

NA I V E B AY E S T R A I N - FA L S E P OS I T I V E

NB_test_fpr

## [1] 0.9826087

NA I V E B AY E S T R A I N - A C C U R A C Y

NB_test_accuracy

## [1] 0.968254

table_NB

## predNB
## 1 2
## 1 113 2
## 2 2 9

#Inference: 96.8% Accuracy


#BAGGING

bag_train = cars_balanced
bag_test = test_data

str(bag_train)

## 'data.frame': 195 obs. of 9 variables:


## $ Age : num 30 29 26 34 18 29 27 24 26 20 ...
## $ Gender : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 1 2 2 2 ...
## $ Engineer : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 1 ...
## $ MBA : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 2 1 2 ...
## $ Work_Exp : num 8 9 2 14 0 6 8 6 4 2 ...
## $ Salary : num 14.8 13.7 8.6 36.9 6.8 14.6 24.9 10.6 12.9 8.8 ...
## $ Distance : num 12 11.4 11 10.4 12.2 7.6 13 8.4 9.6 8.3 ...
## $ License : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1 ...
## $ Transport: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...

str(bag_test)

## 'data.frame': 126 obs. of 9 variables:


## $ Age : int 25 26 24 28 23 28 21 30 24 23 ...
## $ Gender : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 1 2 2 ...
## $ Engineer : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 2 2 ...
29
## $ MBA : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
## $ Work_Exp : int 3 4 6 10 4 5 3 8 4 0 ...
## $ Salary : num 9.6 12.6 12.7 19.7 11.6 14.8 9.8 14.7 12.7 7.7 ...
## $ Distance : num 6.7 7.5 8.7 9 10.7 10.8 11 11.4 11.7 11.7 ...
## $ License : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
## $ Transport: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...

C O N V E RT I N G T R A I N A N D T E S T DA TA S E T S TO N U M E R I C

bag_train$Transport = as.numeric(bag_train$Transport)
bag_train$Gender = as.numeric(bag_train$Gender)
bag_train$Engineer = as.numeric(bag_train$Engineer)
bag_train$MBA = as.numeric(bag_train$MBA)
bag_train$License = as.numeric(bag_train$License)
bag_test$Transport = as.numeric(bag_test$Transport)
bag_test$Gender = as.numeric(bag_test$Gender)
bag_test$Engineer = as.numeric(bag_test$Engineer)
bag_test$MBA = as.numeric(bag_test$MBA)
bag_test$License = as.numeric(bag_test$License)

BA G G I N G M OD E L

library(ipred)
library(rpart)
BAModel = bagging(as.numeric(Transport) ~ ., data = bag_train, control = rpart
.control(maxdepth=7, minsplit=8))

#Prediction using Train Dataset


BApred_train = predict(BAModel, bag_train)
tabBA = table(bag_train$Transport, BApred_train > 0.5)
tabBA

##
## TRUE
## 1 152
## 2 43

INFERENCE: PERCENTAGE OF EMPLOYEES USING CAR IS 22%

#Prediction using Test Dataset


BApred_test = predict(BAModel, test_data)
tabBA1 = table(test_data$Transport, BApred_test > 0.5)
tabBA1

##
## TRUE
## 1 115
## 2 11

30
INFERENCE: PERCENTAGE OF EMPLOYEES USING CAR IS 8.7%

B O OS T I N G

BS_train = cars_balanced
BS_test = test_data

C O N V E RT I N G T R A I N A N D T E S T DA TA S E T S TO N U M E R I C

BS_train$Gender = as.numeric(BS_train$Gender)
BS_train$Engineer = as.numeric(BS_train$Engineer)
BS_train$MBA = as.numeric(BS_train$MBA)
BS_train$License = as.numeric(BS_train$License)

BS_test$Gender = as.numeric(BS_test$Gender)
BS_test$Engineer = as.numeric(BS_test$Engineer)
BS_test$MBA = as.numeric(BS_test$MBA)
BS_test$License = as.numeric(BS_test$License)

#Setting up features for the Boosting Model


features_train = as.matrix(BS_train[,1:8])
lable_train = as.matrix(BS_train[,9])
features_test = as.matrix(BS_test[,1:8])

#Boosting Model
#XGBModel = xgboost(data = features_train, label = lable_train, eta = 0.7, max
_depth = 5, min_child_weight =3, nrounds = 50, nfold = 5, objective = "binary:
logistic", verbose = 0, early_stopping_rounds = 10)

#Predict using Train dataset


#XGBpred_train = predict(XGBModel, features_train)
#tabXGB_train = table(BS_train$Transport, XGBpred_train > 0.5)
#tabXGB_train

#Boost_tpr = tabXGB_train[2,2]/(tabXGB_train[2,1] + tabXGB_train[2,2])


#Boost_fpr = tabXGB_train[1,1]/(tabXGB_train[1,1] + tabXGB_train[1,2])
#Boost_accuracy = sum(diag(tabXGB_train))/sum(tabXGB_train)

#Boosting -Train True Positive


#Boost_tpr

#Boosting -Train False Positive


#Boost_fpr

#Boosting -Train Accuracy


#Boost_accuracy
31
#Inference: 99.5% Accuracy
#Predict using Test dataset

#XGBpred_test = predict(XGBModel, features_test)


#tabXGB_test = table(BS_test$Transport, XGBpred_test > 0.5)
#tabXGB_test

#Boost_tpr1 = tabXGB_test[2,2]/(tabXGB_test[2,1] + tabXGB_test[2,2])


#Boost_fpr1 = tabXGB_test[1,1]/(tabXGB_test[1,1] + tabXGB_test[1,2])
#Boost_accuracy1 = sum(diag(tabXGB_test))/sum(tabXGB_test)

#Boosting -Test True Positive


#Boost_tpr1

#Boosting -Test False Positive


#Boost_fpr

#Boosting -Test Accuracy


#Boost_accuracy

32

You might also like