Machine Learning Project On Cars
Machine Learning Project On Cars
Machine Learning Project On Cars
Problem Statement
This project requires you to understand what mode of transport employees prefers to commute to
their office. The attached data 'Cars.csv' includes employee information about their mode of transport
as well as their personal and professional details like age, salary, work exp. We need to predict
whether or not an employee will use Car as a mode of transport. Also, which variables are a
significant predictor behind this decision?
1.EDA
2.Data Preparation
3.Modeling
Create multiple models and explore how each model perform using appropriate model
performance metrics
o KNN
o Naive Bayes (is it applicable here? comment and if it is not applicable, how can
you build an NB model in this case?)
o Logistic Regression
Apply both bagging and boosting modeling procedures to create 2 models and compare its
accuracy with the best model of the above step.
Summarize your findings from the exercise in a concise yet actionable note
Data Importing –
setwd("C:\\Users\\Bhumika\\Documents\\Analytics\\Project - 5")
library(readr)
Cars=read_csv("Cars_edited.csv")
View(Cars)
Cars$MBA = as.factor(Cars$MBA)
Cars$license = as.factor(Cars$license)
summary(Cars)
Cars[Cars=="Male"]<- 0
hist(Cars$Age)
hist(as.numeric(Cars$Engineer))
hist(as.numeric(Cars$MBA))
hist(Cars$Work_Exp)
hist(Cars$Salary)
hist(Cars$Distance)
hist(as.numeric(Cars$license))
Bivariate Analysis –
boxplot(Cars$Age ~Cars$Engineer, main = "Age vs Eng.")
People of all qualifications and all work experience would be employed in firm.
boxplot(Cars$Salary ~Cars$Engineer, main = "Salary vs Eng.")
boxplot(Cars$Work_Exp ~ Cars$Gender)
Not much of difference between mean work experience in two genders, so population is
equally distributed for both male and females.
Cars[Cars=="Public Transport"]<- 0
Cars[Cars=="Car"]<- 1
library(VIF)
vifcor(Cars[-9])
Work_Exp
After excluding the collinear variables, the linear correlation coefficients ranges between:
min correlation ( Salary ~ MBA ): -0.007592236
max correlation ( Salary ~ Age ): 0.8607652
Remove Work_Exp-
Cars <- Cars[-5]
names(Cars)
Removing Outliers –
quantile(Cars$Age, c(0.95))
Cars$Age[which(Cars$Age>38)]<- 38
quantile(Cars$Age, c(0.95))
95%
38
quantile(Cars$Salary,c(0.95))
Cars$Salary[which(Cars$Salary>43)] <- 43
quantile(Cars$Salary,c(0.95))
95%
43
quantile(Cars$Distance,c(0.95))
quantile(Cars$Distance,c(0.95))
95%
17.89
SMOTE -
# SMOTE
library(DMwR)
library(caret)
set.seed(42)
summary(Carsdata$Transport)
index=createDataPartition(y=Carsdata$Transport,p=0.7,list=FALSE)
traindata=Carsdata[index,]
table(traindata$Transport)
0 1
171 129
testdata=Carsdata[-index,]
table(testdata$Transport)
0 1
73 54
Logistic Regression -
lgmodel <- glm(formula= Transport ~.,traindata, family=binomial)
lgmodel
Coefficients:
(Intercept) Age Gender Engineer MBA Salary Distance license
-92.10708 2.54987 7.07060 1.66439 -6.15736 -0.08231 1.00486 2.82411
Naive Bayes -
library(e1071)
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
A-priori probabilities:
Y
0 1
0.57 0.43
Conditional probabilities:
Age
Y [,1] [,2]
0 26.56725 2.904334
1 35.79332 3.221259
Gender
Y [,1] [,2]
0 0.2339181 0.4245640
1 0.2103642 0.4060913
Engineer
Y [,1] [,2]
0 1.789474 0.4088798
1 1.872710 0.3313629
MBA
Y [,1] [,2]
0 1.304094 0.4613735
1 1.211885 0.4053939
Salary
Y [,1] [,2]
0 13.47310 5.488242
1 36.90872 12.250825
Distance
Y [,1] [,2]
0 10.61111 3.047689
1 15.71253 3.301019
license
Y [,1] [,2]
0 1.134503 0.3421939
1 1.775857 0.4124676
NB_predictions <- predict(NBmodel,testdata)
table(NB_predictions,testdata$Transport)
NB_predictions 0 1
0 70 6
1 3 48
Confusion Matrix
confusionMatrix(NB_predictions,testdata$Transport)
Reference
Prediction 0 1
0 70 6
1 3 48
Accuracy : 0.9291
95% CI : (0.8697, 0.9671)
No Information Rate : 0.5748
P-Value [Acc > NIR] : <2e-16
Kappa : 0.854
'Positive' Class : 0
KNN -
library(class)
method = "knn",
trControl = trControl,
metric = "Accuracy",
preProcess = c("center","scale"),
data = traindata)
KNNmod
k-Nearest Neighbors
300 samples
7 predictor
2 classes: '0', '1'
k Accuracy Kappa
2 0.9501001 0.8979187
3 0.9502076 0.8985918
4 0.9469818 0.8922827
5 0.9504227 0.8993616
6 0.9501001 0.8979187
7 0.9434334 0.8841002
8 0.9401001 0.8774934
9 0.9365369 0.8690393
10 0.9433185 0.8835794
11 0.9399852 0.8768563
12 0.9399852 0.8768618
13 0.9265369 0.8491158
14 0.9266518 0.8499538
15 0.9232036 0.8422598
16 0.9134186 0.8206479
17 0.9067519 0.8063086
18 0.9098628 0.8135544
19 0.8933037 0.7789983
20 0.8865295 0.7643311
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.
KNN_predictions <- predict(KNNmod,testdata)
confusionMatrix(KNN_predictions, testdata$Transport)
Reference
Prediction 0 1
0 69 3
1 4 51
Accuracy : 0.9449
95% CI : (0.8897, 0.9776)
No Information Rate : 0.5748
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8875
Sensitivity : 0.9452
Specificity : 0.9444
Pos Pred Value : 0.9583
Neg Pred Value : 0.9273
Prevalence : 0.5748
Detection Rate : 0.5433
Detection Prevalence : 0.5669
Balanced Accuracy : 0.9448
'Positive' Class : 0
summary(testdata$Transport)
> summary(testdata$Transport)
0 1
73 54
Bagging -
library(gbm)
library(xgboost)
library(caret)
library(ipred)
library(plyr)
library(rpart)
data=traindata,
control=rpart.control(maxdepth=5, minsplit=4))
confusionMatrix(bag.pred,testdata$Transport)
Reference
Prediction 0 1
0 68 2
1 5 52
Accuracy : 0.9449
95% CI : (0.8897, 0.9776)
No Information Rate : 0.5748
P-Value [Acc > NIR] : <2e-16
Kappa : 0.888
Sensitivity : 0.9315
Specificity : 0.9630
Pos Pred Value : 0.9714
Neg Pred Value : 0.9123
Prevalence : 0.5748
Detection Rate : 0.5354
Detection Prevalence : 0.5512
Balanced Accuracy : 0.9472
'Positive' Class : 0
Boosting -
mod.boost <- gbm(Transport ~ .,data=traindata, distribution=
summary(mod.boost)
var rel.inf
Age Age 84.1483764
Salary Salary 8.9887923
Distance Distance 5.3097642
MBA MBA 0.9294766
license license 0.4844714
Gender Gender 0.1241797
Engineer Engineer 0.0149394
boost.pred <- predict(mod.boost, testdata,n.trees =5000, type="response")
table(y_pred,testdata$Transport)
y_pred 0 1
0 72 2
1 1 52
confusionMatrix(y_pred,testdata$Transport)
Reference
Prediction 0 1
0 72 2
1 1 52
Accuracy : 0.9764
95% CI : (0.9325, 0.9951)
No Information Rate : 0.5748
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9516
Sensitivity : 0.9863
Specificity : 0.9630
Pos Pred Value : 0.9730
Neg Pred Value : 0.9811
Prevalence : 0.5748
Detection Rate : 0.5669
Detection Prevalence : 0.5827
Balanced Accuracy : 0.9746
'Positive' Class : 0
Model Performance –
library(ROCR)
plot(perf.lg)
#Kolmogorov Smirnov -
KS <- max(attr(perf.lg, 'y.values')[[1]]-attr(perf.lg, 'x.values')[[1]])
KS
[1] 0.9170472
auc
[1] 0.992136
# Gini Coefficient -
library(ineq)
gini
[1] 0.5804087
Model Comparison –
As we can see that Boosting has the highest sensitivity as compared to KNN,Naïve Bayes and Bagging.
Therefore sensitivity and accuracy is highly relative for Boosting.