Random Forest Reference Code
Random Forest Reference Code
Random Forest
www.proschoolonline.com
Random Forest
#Random Forest model
modelrf <- randomForest(as.factor(left) ~ . , data = trainSplit, do.trace=T)
modelrf
The random forest model output tells us that it has built 500 trees and used 3 variables for each tree building.
Out-of-bag estimate for the generalization error is the error rate of the out-of-bag classifier on the training set.
The OOB estimate is as accurate as using a test set of the same size as the training set. Therefore, using the out-
of-bag error estimate removes the need for a set aside test set.
www.proschoolonline.com
Random Forest
#Checking variable importance in Random Forest
importance(modelrf)
varImpPlot(modelrf)
www.proschoolonline.com
Random Forest
# Prediction and Model Evaluation using Confusion Matrix
predrf_tr <- predict(modelrf, trainSplit) #Train Data
predrf_test <- predict(modelrf, testSplit) #Test Data
As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Random Forest model
www.proschoolonline.com
Comparing ROC curves for Decision Tree and Random Forest
www.proschoolonline.com
Classification Model
Naïve Bayes
www.proschoolonline.com
Naïve Bayes
#Naive Bayes
modelnb <- naiveBayes(as.factor(left) ~. , data = trainSplit)
modelnb
These are the apriori probabilities for the variables in the dataset
www.proschoolonline.com
Naïve Bayes
#Performance of Naïve Bayes using Confusion Matrix
prednb_tr <- predict(modelnb,trainSplit) #Train Data
prednb_test <- predict(modelnb,testSplit) #Test Data
As we observe, the model shows similar performance on Train and Test data and hence we assure stability of our Naïve Bayes model
www.proschoolonline.com
Classification Model
kNN Algorithm
www.proschoolonline.com
kNN Algorithm
#Data Preparation for kNN Algorithm
library(dummies)
#Creating dummy variables for Factor variable
dummy_df = dummy.data.frame(hr_data1[, c('role_code', 'salary.code')])
hr_data2 = hr_data1
hr_data2 = cbind.data.frame(hr_data2, dummy_df)
www.proschoolonline.com
kNN Algorithm
#Data Preparation for kNN Algorithm
str(hr_data2_scaled)
www.proschoolonline.com
kNN Algorithm
#Applying kNN Algorithm on the dataset
library(class)
library(gmodels)
Accuracy = (TP+TN)/Total
= (3311+1030)/4499
= 96.48%
www.proschoolonline.com
kNN Algorithm
#Applying kNN Algorithm on the dataset
K Accuracy
5 94.46%
10 94.17%
50 90.19%
100 86.48%
122 85.06%
From the above accuracy table, we can observe that as the k value increases the accuracy goes
down.
www.proschoolonline.com
kNN Algorithm
# Thumb rule to decide on k for k-NN is sqrt(n)/2
k = sqrt(nrow(hr_train))/2
k
#51.2347 (which can be approximated to 51
www.proschoolonline.com
Step 6
Model
Summarization
www.proschoolonline.com
Summary of Model Performance
Model Accuracy
Decision Tree 97.09%
Random Forest 99%
Naïve Bayes 78.84%
kNN Algorithm (Using k = 7) 96.84%
www.proschoolonline.com
Appendix
Packages used for the Classification Analysis:
•data.table
•reshape2
•randomForest
•party # For decision tree
•rpart # for Rpart
•rpart.plot #for Rpart plot
•lattice # Used for Data Visualization
•caret # for data pre-processing
•pROC # for ROC curve
•corrplot # for correlation plot
•e1071 # for ROC curve and Confusion matrix
•RColorBrewer
•dummies
•class
•gmodels
www.proschoolonline.com
Thank You.