Project 4 - Cars-Datasets PDF
Project 4 - Cars-Datasets PDF
Project 4 - Cars-Datasets PDF
P R E D IC T IN G MO DE O F T R A N S P O RT
ASAAJU BABATUNDE
P RO J E C T BAC KG RO U N D
Objective
The objective of this project is to understand what mode of transport the employees of a particular
organisation prefer to commute to workplace. That is to predict whether or not an employee will use
Car as a mode of transport.
This requires the need to explore the available dataset; “Cars-dataset” in R for exploratory analysis,
description, and prediction. The dataset "Cars-dataset" includes employee information about their
mode of transport as well as their personal and professional details like age, salary, work experience…
Assumptions
Dataset file format is in csv format.
Response variable is a factor data type with 2 levels
Dataset has no missing data values
Expectations
Exploratory Data Analysis -EDA
Perform an EDA on the data
Illustrate the insights based on EDA
What is the most challenging aspect of this problem? What method will you use to deal with
this? Comment (3 marks)
Data Preparation
Prepare the data for analysis
Modeling
Create multiple models and explore how each model perform using appropriate model
performance metrics
o KNN
o Naive Bayes (is it applicable here? comment and if it is not applicable, how can
you build an NB model in this case?)
o Logistic Regression
Apply both bagging and boosting modeling procedures to create 2 models and compare its
accuracy with the best model of the above step.
Actionable Insights & Recommendations
Summarize your findings from the exercise in a concise yet actionable note
2
EXPLORATORY DATA ANA LY SIS
3
SETUP WORKING DIRECTORY
Setting a working directory on starting of the R session makes importing and exporting data files and
code files easier. Basically, working directory is the location/ folder on the PC where i have the
dataset related to the project..
The path on my computer is C:\Users\OLIVIA\Desktop\DSBA/Dataset
Source code
setwd("C:/Users/OLIVIA/Desktop/DSBA/Dataset/")
Source code
cars_data = read.csv("Cars-dataset.csv", header = T)
VARIABLE IDENTIFICATION
R Functions Purpose
dim To display total number of observations and dimension
names List of all the variables
head To view some top observations for possible missing values.
tail To view some bottom observations for possible missing values.
str To know the data type and structure of present variables
summary Overview of measures of centrality, distribution and dispersion.
head(cars_data)
tail(cars_data)
summary(cars_data)
4
## Age Gender Engineer MBA
## Min. :18.00 Female:121 Min. :0.0000 Min. :0.0000
## 1st Qu.:25.00 Male :297 1st Qu.:0.2500 1st Qu.:0.0000
## Median :27.00 Median :1.0000 Median :0.0000
## Mean :27.33 Mean :0.7488 Mean :0.2614
## 3rd Qu.:29.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :43.00 Max. :1.0000 Max. :1.0000
## NA's :1
## Work_Exp Salary Distance License Transpo
rt
## Min. : 0.000 Min. : 6.500 Min. : 3.20 Min. :0.0000 0:383
## 1st Qu.: 3.000 1st Qu.: 9.625 1st Qu.: 8.60 1st Qu.:0.0000 1: 35
## Median : 5.000 Median :13.000 Median :10.90 Median :0.0000
## Mean : 5.873 Mean :15.418 Mean :11.29 Mean :0.2033
## 3rd Qu.: 8.000 3rd Qu.:14.900 3rd Qu.:13.57 3rd Qu.:0.0000
## Max. :24.000 Max. :57.000 Max. :23.40 Max. :1.0000
##
colSums(is.na(cars_data))
Inferences:
1. Dataset has 418 Observations for 9 variables (Age, Gender, Engineer, MBA,
Work Experience, Salary, License and Transport.
4. Out of the 418 employees, 121 are female and 297 are male.
VISUALIZATION
table(cars_data$Transport)
##
## 0 1
## 383 35
prop.table(table(cars_data$Transport))
##
## 0 1
## 0.91626794 0.08373206
Inference:
UNIVARIATE ANALYSIS
Inference:
2. Engineer and MBA data appeared to be just 0 and 1 like categorical data
BIVARIATE ANALYSIS
cars_data$Gender = as.numeric(cars_data$Gender)
cars_data$Transport = as.numeric(cars_data$Transport)
cars_corr = cor(cars_data)
corrplot(cars_corr
Inference:
DATA PREPARATION
cars_data[is.na(cars_data)] = 0
sapply(cars_data, function(x) sum(is.na(x)))
Inference:
OUTLIERS TREATMENT
Boxplot method was used.
Inference:
VARIABLE TRANSFORMATION
cars_data$Gender = as.factor(cars_data$Gender)
cars_data$Engineer = as.factor(cars_data$Engineer)
cars_data$MBA = as.factor(cars_data$MBA)
cars_data$License = as.factor(cars_data$License)
cars_data$Transport = as.factor(cars_data$Transport)
DATA SPLITING
set.seed(1234)
sample = sample.split(cars_data$Transport, SplitRatio = 0.7)
train_data = subset(cars_data, sample ==T)
test_data = subset(cars_data, sample == F)
dim(train_data)
## [1] 292 9
8
dim(test_data)
## [1] 126 9
Inference:
The proportion of the Target variable is spread well across train and test
datasets.
Inference:
MODEL BULDING
logr_train = cars_balanced
logr_test = test_data
9
logr_model = glm(Transport~ ., data = logr_train, family = binomial)
See appendix
Inference:
APPLYING KNN
See appendix
Inference:
See appendix
Inference:
Prediction: Naïve Bayes model has 96.8% accuracy on the train dataset.
10
APPLYING BAGGING
See appendix
Inference:
Observation:
Percentage of employees using car is 22% on train data
Percentage of employees using car is 8.7% on test data
*See appendix – It ran successfully on the R script (Markdown) but could not knit into word.
Therefore, I commented it.
Inference:
A C T I O NA B L E I N S I G H T S A N D R E C OM M E N DA T I ON S
11
Boosting shows that the proportion of actual employees using other mode of transport is 100% correct.
Observations
Age (older), Salary (high paid), and Work Experience (long serving) are highly influential factors
for the choice of car.
Appendix
12
UNTITLED
unlink("C:/Users/OLIVIA/Documents/R/win-library/3.6/00LOCK", recursive = TRUE)
library(MASS)
library(dummies)
library(ggplot2)
library(caret)
library(Information)
library(caTools)
library(ROCR)
##
## Attaching package: 'gplots'
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
library(tidyr)
library(corrplot)
library(ggplot2)
library(GGally)
##
## Attaching package: 'GGally'
library(factoextra)
library(e1071)
library(lattice)
library(mice)
2
##
## Attaching package: 'mice'
library(xgboost)
##
## Attaching package: 'xgboost'
library(class)
library(gbm)
library(ipred)
library(rpart)
library(DMwR)
library(rms)
##
## Attaching package: 'survival'
3
## The following object is masked from 'package:caret':
##
## cluster
##
## Attaching package: 'Hmisc'
##
## Attaching package: 'SparseM'
##
## Attaching package: 'rms'
library(forcats)
WO R K I N G D I R E C TO RY S E T U P
setwd("C:/Users/OLIVIA/Desktop/DSBA/Dataset/")
I M P ORT DA TA
4
C H A N G E C L OU M N N A M E S
names(cars_data)
new_vars = c("Age","Gender","Engineer","MBA","Work_Exp","Salary","Distance","L
icense","Transport")
colnames(cars_data) = new_vars
E X P L O NA TO RY DA TA A NA LY S I S - E DA
## [1] 418 9
INFERENCE: DATASET HAS 418 OBSERVATIONS FOR 9 VARIABLES (AGE, GENDER, ENGINEER,
MBA, WORK.EXP, SALARY, DISTANCE, LICENSE AND TRANSPORT)
str(cars_data) #View class of each feature along with the internal structure
C O N V E RT I N G TA RG E T VA R I A B L E T O 2 L E V E L S FA C T OR : 1 F O R C A R A N D 0 F O R OT H E R S
( 2 W H E E L E R A N D T R A N S P O RT )
str(cars_data)
M O R E P R E L I M I NA RY A NA LY S I S
head(cars_data)
tail(cars_data)
summary(cars_data)
colSums(is.na(cars_data))
##Inference: #1. Out of the 418 employees, 121 are female and 297 are male. #2. MBA has
missing value #3. Engineer, MBA and licence are numeric but 1 and 0. #4. Outliers seem to be
present in Age, Work.Exp, Salary and Distance
#VISUALIZATION
table(cars_data$Transport)
##
## 0 1
## 383 35
prop.table(table(cars_data$Transport))
##
## 0 1
## 0.91626794 0.08373206
##Inference: #1. Employee choosing car as means of transport are 35 @ 8.37% -Minority sample
data #2. Employee choosing other means of transport (2wheeler/public transport) are 383 @
91.62% - Majorit sample data
#Univariate Analysis - Numeric Variables
7
hist(cars_data$Engineer, col ="purple", main = "Engineer")
8
boxplot(cars_data$Age, horizontal = T, col = "red", xlab = "Age")
9
boxplot(cars_data$MBA, horizontal = T, col = "blue", xlab = "MBA")
10
hist(cars_data$Salary, col ="purple", main = "Salary")
11
boxplot(cars_data$Salary, horizontal = T, col ="purple", xlab = "Salary")
12
hist(cars_data$Distance, col = "orange", main = "Distance")
13
boxplot(cars_data$Distance, horizontal = T, col = "orange", xlab = "Distance")
14
INFERENCE:
#1. Age data is skewed to the right with outliers #2. Engineer and MBA data appeared to be just
0 and 1 like categorical data #3. WorK.Exp and Salary data are right skewed #4. Work.Exp and
Salary data have outliers. #5. license data appeared to be just 0 and 1 like categorical data #6.
Distance data is almost symentric with view outliers.
#Univariate Analysis - Factor Variables (Gender and Transport)
par(mfrow = c(1,1))
15
## Inference: #1.
Male employees are more than female employees #2. Target variable; “Transport” is now 2
levels factor variable
B I VA R I A T E A N A LY S I S
cars_data$Gender = as.numeric(cars_data$Gender)
cars_data$Transport = as.numeric(cars_data$Transport)
cars_corr = cor(cars_data)
16
#Inference: Age, Work Experience, Salary and Transportation are positively correlated.
DA TA P R E PA R A T I ON
17
## Age Gender Engineer Work_Exp Salary Distance License Transport MBA
## 417 1 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 1 1 1 1 0 1
## 0 0 0 0 0 0 0 0 1 1
I N F E R E N C E : 1 N A I N C L OU M N M B A
summary(cars_data)
I N F E R E N C E : T H E NA T H A T P R E S E N T S I N M BA H A S N O S I G N I F I C A N T E F F E C T O N I T S
DA TA S E T ( M E A N W I T H N A I S 0 . 2 6 1 4 A N D M E A N W I T H OU T NA I S 0 . 2 6 0 8 )
OU T L I E R S T R E A T M E N T
COLUMN AGE
## [1] 18 38 38 40 36 40 37 39 40 38 36 39 38 40 39 38 42 40 37 43 40 38 37 3
7 39
## [26] 36 36 18
COLUMN WORK_EXP
## [1] 19 20 22 16 20 18 21 20 20 16 17 21 18 20 21 19 22 22 19 24 20 19 19 1
9 21
## [26] 16 16 18 16
COLUMN SALARY
## [1] 23.8 36.9 28.8 37.0 23.8 23.0 48.0 42.0 51.0 45.0 34.0 45.0 42.9 41.0
40.9
## [16] 30.9 41.9 43.0 33.0 36.0 33.0 38.0 46.0 45.0 48.0 35.0 51.0 51.0 55.0
45.0
## [31] 42.0 52.0 38.0 57.0 44.0 45.0 47.0 50.0 36.6 25.9 34.8 28.8 28.7 28.7
34.9
## [46] 23.8 29.9 34.9 24.9 23.9 28.8 23.8
COLUMN DISTANCE
19
INFERENCE: NO TREATMENT OF THESE OUTLIERS IS REQUIRED BECAUSE NONE OF THE
VALUES IS HAVING IMPOSSIBLE VALUE OR ANY TYPO ERROR.
str(cars_data)
DA TA S P L I T I N G
set.seed(1234)
sample = sample.split(cars_data$Transport, SplitRatio = 0.7)
train_data = subset(cars_data, sample ==T)
test_data = subset(cars_data, sample == F)
dim(train_data)
## [1] 292 9
dim(test_data)
## [1] 126 9
#Inference: The proportion of the Target varibale is spread well across train and test datasets
##1. Train dataset = 292 observations and 9 variables (70% of 418) ##2. Test dataset = 126
observations and r variables (30% of 418)
prop.table(table(train_data$Transport))
##
## 1 2
## 0.91780822 0.08219178
prop.table(table(test_data$Transport))
20
##
## 1 2
## 0.91269841 0.08730159
#Inference: Choice of transportation ratio ##1. Train data: 8.2% of the company employees use
car and 92.7% use other means (2wheeler & public tranport) ##2. Test data: 8.7% of the
company employee use car and 91.3% use other means (2wheeler & public tranport)
table(train_data$Transport)
##
## 1 2
## 268 24
#Inference: #1. 24 Employees use car and 268 Employes other means on the train data. #2. The
dataset is imbalancing the area of interest (employees using car) is in the minority.
#Applying SMOTE to the Train Dataset
##
## 1 2
## 152 43
prop.table(table(cars_balanced$Transport))
##
## 1 2
## 0.7794872 0.2205128
#Inference: ##1. The percentage ratio has increased by reducing the majority and increasing the
minority data-point using SMOTE algorithm. ##2. The minority data-point has now increased
from 8.7% to 22%.
M OD E L BU I L D I N G
summary(logr_model)
21
##
## Call:
## glm(formula = Transport ~ ., family = binomial, data = logr_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.271e-05 -2.100e-08 -2.100e-08 -2.100e-08 4.785e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -413.302 330983.397 -0.001 0.999
## Age 5.314 16507.769 0.000 1.000
## Gender2 27.078 36568.137 0.001 0.999
## Engineer1 -17.220 69074.842 0.000 1.000
## MBA1 -31.577 26597.452 -0.001 0.999
## Work_Exp -1.957 14727.803 0.000 1.000
## Salary 3.068 3212.092 0.001 0.999
## Distance 12.966 6397.152 0.002 0.998
## License1 -8.235 23781.297 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.0575e+02 on 194 degrees of freedom
## Residual deviance: 1.4294e-08 on 186 degrees of freedom
## AIC: 18
##
## Number of Fisher Scoring iterations: 25
#Inference: No independent variable shows significance on the target variable which means
there is possibility of multicollinearity.
#Check for multicollinearity
vif(logr_model)
#Inference: Age, Work_Experience and Salary are highly and positive correlated because their
figures are more than 5.
R E M OV E A G E , S A L A RY A N D WO R K E X P E R I E N C E
##
## Call:
22
## glm(formula = Transport ~ ., family = binomial, data = logr_train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.68217 -0.20885 -0.05587 -0.00565 2.13556
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.3447 3.5185 -5.214 1.85e-07 ***
## Gender2 1.5977 0.8156 1.959 0.050105 .
## Engineer1 2.9951 1.3867 2.160 0.030785 *
## MBA1 -1.8855 0.8913 -2.115 0.034398 *
## Distance 0.8679 0.1593 5.449 5.07e-08 ***
## License1 2.5794 0.7349 3.510 0.000448 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 205.747 on 194 degrees of freedom
## Residual deviance: 64.093 on 189 degrees of freedom
## AIC: 76.093
##
## Number of Fisher Scoring iterations: 7
#Inference: The model shows the independent variables present; Age, Gender, MBA, License
and Distance now their high signifance effect on the target variable.
P R E D I C T I O N - T R A I N DA TA S E T
##
## FALSE TRUE
## 1 148 4
## 2 6 37
TR_tpr
## [1] 0.8604651
23
T R A I N - FA L S E P O S I T I V E
TR_fpr
## [1] 0.9736842
T R A I N - A C C UA R A C Y
TR_accuracy
## [1] 0.9487179
##
## FALSE TRUE
## 1 112 3
## 2 2 9
T E S T- T R U E P O S I T I V E
TE_tpr
## [1] 0.8181818
T E S T- FA L S E P O S I T I V E
TE_fpr
## [1] 0.973913
T E S T- A C C UA R A C Y
TE_accuracy
## [1] 0.9603175
INFERENCE: LOGISTIC REGRESSION MODEL HAS 95% ACCURACY IN TRAIN DATASET BUT 96%
ACCURACY IN TEST DATASET WHICH CAN BE CONSIDERED AS GOOD FIT.
24
#Coverting all the factor variables to numeric and then scaling the data
Knn_data = cars_balanced
Knn_data$Gender = as.numeric(Knn_data$Gender)
Knn_data$Engineer = as.numeric(Knn_data$Engineer)
Knn_data$MBA = as.numeric(Knn_data$MBA)
Knn_data$License = as.numeric(Knn_data$License)
Knn_data_test = test_data
Knn_data_test$Gender = as.numeric(Knn_data_test$Gender)
Knn_data_test$Engineer = as.numeric(Knn_data_test$Engineer)
Knn_data_test$MBA = as.numeric(Knn_data_test$MBA)
Knn_data_test$License = as.numeric(Knn_data_test$License)
str(Knn_train)
Knn_train$Transport = as.factor(Knn_train$Transport)
Knn_test$Transport = as.factor(Knn_test$Transport)
Train_Knn
##
## 1 2
## 0.7794872 0.2205128
Test_Knn
##
## 1 2
## 0.91269841 0.08730159
25
C H E C K I N G T H E S U M M A RY OF T H E K N N M OD E L
## k-Nearest Neighbors
##
## 195 samples
## 8 predictor
## 2 classes: '1', '2'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 195, 195, 195, 195, 195, 195, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9657679 0.8942526
## 7 0.9681026 0.9011497
## 9 0.9693036 0.9045171
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
P R E D I C T I O N U S I N G K N N M OD E L F OR T R A I N DA TA S E T
K N N TA I N - T RU E P OS I T I V E
Knn_train_tpr
## [1] 0.9534884
K N N T R A I N - FA L S E P O S I T I V E
Knn_train_fpr
## [1] 0.9868421
26
KNN TRAIN -ACCURACY
Knn_train_accuracy
## [1] 0.9794872
Knn_train_table
## predKNN_fit
## 1 2
## 1 150 2
## 2 2 41
P R E D I C T I O N U S I N G K N N M OD E L F OR T E S T DA TA S E T
K N N T E S T - T RU E P O S I T I V E
Knn_test_tpr
## [1] 0.8181818
K N N T E S T - FA L S E P OS I T I V E
Knn_test_fpr
## [1] 0.9652174
Knn_test_accuracy
## [1] 0.952381
Knn_test_table
## predKNN_fit_test
## 1 2
## 1 111 4
## 2 2 9
library(e1071)
NBmodel = naiveBayes(Transport ~., data = nb_train)
P R E D I C T I O N U S I N G T R A I N DA TA S E T
NA I V E B AY E S T R A I N - T RU E P O S I T I V E
NB_train_tpr
## [1] 0.9767442
NA I V E B AY E S T R A I N - FA L S E P OS I T I V E
NB_train_fpr
## [1] 0.9605263
NA I V E B AY E S T R A I N - A C C U R A C Y
NB_train_accuracy
## [1] 0.9641026
table_NB_t
## predNB
## 1 2
## 1 146 6
## 2 1 42
P R E D I C T I O N U S I N G T E S T DA TA S E T
28
NA I V E B AY E S T R A I N - T RU E P O S I T I V E
NB_test_tpr
## [1] 0.8181818
NA I V E B AY E S T R A I N - FA L S E P OS I T I V E
NB_test_fpr
## [1] 0.9826087
NA I V E B AY E S T R A I N - A C C U R A C Y
NB_test_accuracy
## [1] 0.968254
table_NB
## predNB
## 1 2
## 1 113 2
## 2 2 9
bag_train = cars_balanced
bag_test = test_data
str(bag_train)
str(bag_test)
C O N V E RT I N G T R A I N A N D T E S T DA TA S E T S TO N U M E R I C
bag_train$Transport = as.numeric(bag_train$Transport)
bag_train$Gender = as.numeric(bag_train$Gender)
bag_train$Engineer = as.numeric(bag_train$Engineer)
bag_train$MBA = as.numeric(bag_train$MBA)
bag_train$License = as.numeric(bag_train$License)
bag_test$Transport = as.numeric(bag_test$Transport)
bag_test$Gender = as.numeric(bag_test$Gender)
bag_test$Engineer = as.numeric(bag_test$Engineer)
bag_test$MBA = as.numeric(bag_test$MBA)
bag_test$License = as.numeric(bag_test$License)
BA G G I N G M OD E L
library(ipred)
library(rpart)
BAModel = bagging(as.numeric(Transport) ~ ., data = bag_train, control = rpart
.control(maxdepth=7, minsplit=8))
##
## TRUE
## 1 152
## 2 43
##
## TRUE
## 1 115
## 2 11
30
INFERENCE: PERCENTAGE OF EMPLOYEES USING CAR IS 8.7%
B O OS T I N G
BS_train = cars_balanced
BS_test = test_data
C O N V E RT I N G T R A I N A N D T E S T DA TA S E T S TO N U M E R I C
BS_train$Gender = as.numeric(BS_train$Gender)
BS_train$Engineer = as.numeric(BS_train$Engineer)
BS_train$MBA = as.numeric(BS_train$MBA)
BS_train$License = as.numeric(BS_train$License)
BS_test$Gender = as.numeric(BS_test$Gender)
BS_test$Engineer = as.numeric(BS_test$Engineer)
BS_test$MBA = as.numeric(BS_test$MBA)
BS_test$License = as.numeric(BS_test$License)
#Boosting Model
#XGBModel = xgboost(data = features_train, label = lable_train, eta = 0.7, max
_depth = 5, min_child_weight =3, nrounds = 50, nfold = 5, objective = "binary:
logistic", verbose = 0, early_stopping_rounds = 10)
32