Machine Learning Project
Machine Learning Project
MACHINE
LEARNING
PROJECT
Table of Contents
Contents
Problem 1 Executive Summary………………………………………………………………………………………………………………………3
Introduction…………………………………………………………………………………………………………………………………………………..3
Data Description…………………………………………………………………………………………………………………………………………….3
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check…………………………….. 4-6
Sample……………………………………………………………………………………………………………………………………………4
Shape………………………………………………………………………………………………………………………………………………4
Data Types……………………………………………………………………………………………………………………………………….4
Null Value Check……………………………………………………………………………………………………………………………..5
Summary Stats of Numerical Columns…………………………………………………………………………………………….5
Summary Stats of Categorical Columns……………………………………………………………………………………………6
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers…………………….6-12
Checking for Duplicates and its treatment………………………………………………………………………………………..6
Univariate Analysis……………………………………………………………………………………………………………………………6-10
Bivariate Analysis………………………………………………………………………………………………………………………………10-11
Outlier Check…………………………………………………………………………………………………………………………………….11-12
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test………………………………………………………………………………………………………………….12-13
1.4 Apply Logistic Regression and LDA (linear discriminant analysis)………………………………………………………………….13-16
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results………………………………………………………………….16-19
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging) and Boosting………………………………….19-21
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model.
Final Model: Compare the models and write inference which model is best/optimized………………………………..21-31
1.8 Based on these predictions, what are the insights?...........................................................................................31-32
Problem 2 Introduction…………………………………………………………………………………………………………………………………………32
2.1 Find the number of characters, words, and sentences for the mentioned documents……………………………………32-33
2.2 Remove all the stopwords from all three speeches.……………………………………………………………………………………….33
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words………………………………………………………………………………………………………………………………33-34
2.4 Plot the word cloud of each of the speeches of the variable. ………………………………………………………………………..34-35
List of Figures
Problem 1
Fig. 1- age Histplot & Boxplot……………………………………………………………………………………………………………………………………6
Fig. 2- economic.cond.national Histplot & Boxplot…………………………………………………………………………………………………….7
Fig. 3- economic.cond.household Histplot & Boxplot……………………………………………………………………………………………….7
Fig. 4 - Blair Histplot & Boxplot………………………………………………………………………………………………………………………………….8
Fig. 5 - Hague Histplot & Boxplot………………………………………………………………………………………………………………………………8
Fig. 6 - Europe Histplot & Boxplot…………………………………………………………………………………………………………………………….9
Fig. 7 - political.knowledge Histplot & Boxplot…………………………………………………………………………………………………………9
Fig. 8 - vote Countplot……………………………………………………………………………………………………………………………………………….10
Fig. 9 - gender Countplot………………………………………………………………………………………………………………………………………….10
Fig. 10 - vote v/s age Stripplot………………………………………………………………………………………………………………………………….10
Fig. 11 - Correlation Heatmap……………………………………………………………………………………………………………………………………11
Fig. 12 - Numerical Columns with Outliers…………………………………………………………………………………………………………………11
Fig. 13 - MCE v/s K-Neighbours Plot………………………………………………………………………………………………………………………….17
2
Problem 2
Fig. 30 - WordCloud of President Franklin D. Roosevelt’s Speech 1941……………………………………………………………………..34
Fig. 31 - WordCloud of President John F. Kennedy’s Speech 1961…………………………………………………………………………….35
Fig. 32 - WordCloud of President Richard Nixon’s Speech 1973………………………………………………………………………………..35
List of Tables
Problem 1
Table 1 : Dataset Sample…………………………………………………………………………………………………………………………………………....4
Table 2 : Data type table…………………………………………………………………………………………………………………………………………….4
Table 3 : Null value check table………………………………………………………………………………………………………………………………….5
Table 4 : Summary of Numerical Columns………………………………………………………………………………………………………………….5
Table 5 : Summary of Categorical Columns………………………………………………………………………………………………………………..6
Table 6 : Duplicates table …………………………………………………………………………………………………..........................................6
Table 7 : Sample dataset after Encoding…………………………………………………………………………………………………...................12
Table 8 : Five-Point Summary Before Scaling……………………………………………………………………………………………………………12
Table 9 : Five-Point Summary After Scaling………………………………………………………………………………………………………………13
Table 10 : Classification Report of Train Data-Logistic Regression……………………………………………………………………………14
Table 11 : Classification Report of Test Data-Logistic Regression………………………………………………………………………………14
Table 12 : Classification Report of Train Data-LDA……………………………………………………………………………………………………15
Table 13 : Classification Report of Test Data-LDA………………………………………………………………………………………………………15
Table 14 : Classification Report of Train Data-KNN……………………………………………………………………………………………………17
Table 15 : Classification Report of Test Data-KNN…………………………………………………………………………………………………….17
Table 16 : Classification Report of Train Data-Naïve Bayes………………………………………………………………………………………18
Table 17 : Classification Report of Test Data-Naïve Bayes…………………………………………………………………………………………18
Table 18 : Classification Report of Train Data-Bagging(RF) ………………………………………………………………………………………19
Table 19 : Classification Report of Test Data-Bagging(RF) ………………………………………………………………………………………19
Table 20 : Classification Report of Train Data-Ada Boost…………………………………………………………………………………………20
Table 21 : Classification Report of Test Data-Ada Boost…………………………………………………………………………………………..20
Table 22 : Classification Report of Train Data-Gradient Boosting………………………………………………………………………………20
Table 23 : Classification Report of Test Data- Gradient Boosting………………………………………………………………………………21
Table 24: Comparison Summary of Logistic Regression, LDA & KNN models……………………………………………………………28
Table 25: Comparison Summary of Naïve Bayes, Bagging , Ada Boost & Gradient Boosting models…………………………28
Table 26 : Comparison Summary of LDA & Naïve Bayes after SMOTE………………………………………………………………………29
Table 27 : Comparing performances before and after SMOTE………………………………………………………………………………….29
3
Datasets Used
Executive Summary
You are hired by one of the leading news channels CNBE who wants to analyse recent elections. This
election dataset contains a survey that was conducted on 1525 voters with 9 variables.
Introduction
The purpose of this whole exercise is to build a model, to predict which party a voter will vote for on the
basis of the given information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.
Data Description
System measures used:
Vote: Party choice: Conservative or Labour
age: in years
economic.cond.national: Assessment of current national economic conditions, 1 to 5.
economic.cond.household: Assessment of current household economic conditions, 1 to 5.
Blair: Assessment of the Labour leader, 1 to 5.
Hague: Assessment of the Conservative leader, 1 to 5.
Europe: an 11-point scale that measures respondents' attitudes toward European integration. High scores
represent ‘Eurosceptic’ sentiment.
political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.
gender: female or male.
1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it.
Data Types:
Out of 10, 8 columns are of integer type and rest 2 columns are of object data type.
As seen in the above table, there are no null values in the dataset .
Let’s also try to test whether any categorical attribute contains a “?” in it or not. At times there exists “?” or
” ” in place of missing values. As seen from the code output in juptyer file , there are no "?" or " " present
in the data set.
From the above summary, we can infer that the average age of the voters is around 54 years and minimum
age is 24 years while maximum age is 93 years and also there is major difference between the 75
percentile value and maximum value of age column which means that age feature is slightly skewed to the
right and does not follow a normal distribution.
All other features except age represents normal distribution as the difference is not huge between 75
percentile value and maximum value.
From the above summary, we can infer that Labour party has received maximum number of votes and
there are slightly more number of female voters compared to male voters.
There are 8 duplicates in the dataset . So lets remove them first and then do the analysis.
Univariate Analysis
Age feature is slightly right skewed because of which the data is not normally distributed and most number
of voters lies within 40 to 80 age group.
7
This feature has one outlier in the lower values and it does not show any distribution because it is a
categorical feature coded as ordinal numbers from 1 to 5 as already mentioned in the data dictionary.
This feature also has one outlier in the lower values and it does not show any distribution because it is a
categorical feature coded as ordinal numbers from 1 to 5 as already mentioned in the data dictionary.
8
This feature has no outlier and it does not show any distribution because it is a categorical feature coded
as ordinal numbers from 1 to 5 as already mentioned in the data dictionary. And most of the labour party
leader assessment grades lies between 2 to 4.
This feature also has no outlier and it does not show any distribution because it is a categorical feature in
coded form from 1 to 5 as already mentioned in the data dictionary.
9
This feature represents n 11-point scale that measures respondents' attitudes toward European
integration. High scores represent ‘Eurosceptic’ sentiment. And as seen in the above plots , there is more
data distribution in 6 to 10 scores range which indicates Eurosceptic sentiment.
From the above plots , it can be inferred that most of the respondents’ knowledge of parties' positions on
European integration is quite low.
10
Bivariate Analysis
From the above plot we can infer that aged people of 85 and above have voted for Conservative party .
Correlation Heatmap
11
There is hardly any correlation between any of the columns in this dataset.
As there is no correlation between any of the columns , doing a multivariate analysis using scatterplot
makes no sense for this dataset .
Outlier Check
Only 2 outliers can be seen in the economic.cond.household and economic.cond.national columns but we
decide not to treat these outliers as these columns have ordinal set of numbers and outliers are to be
treated for only continuous columns analyses and also because there are very less number of outliers
present in the dataset .
1.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and test
(70:30).
We will only encode the “gender” column using get dummies as all other categorical columns have ordinal
values and are of integer data type , so there is no need to encode them again.
The target column “vote” has two equally important classes but has object data type , so this needs to be
converted into integer data type for performing different models , hence we will use Label Encoder which
will convert the target variable into numeric and encode the data as well as 1 being the Labour party and 0
being the Conservative Party respectively because the maximum number of votes have been casted for
Labour party , hence it will be tagged as 1.
Lets check whether scaling is necessary or not for this dataset by analysing the mean, standard deviation
and variance of all numerical features :
We can observe that only age and Europe feature requires scaling as its mean, standard deviation and
variance are not on the same scale unlike other features. Although Europe column is in ordinal state
ranking from 0 to 11 but its needs to be scaled as its range is different from the other columns. Scaling is a
necessity when using Distance-based models such as KNN etc. It also helps stabilize the accuracy of a
model and makes training faster. And without scaling , the algorithm may be biased toward the feature
with values higher in magnitude.
So we will use the MinMax Scaler from Sklearn library to scale the age and Europe columns and re-check
the five point summary again after scaling as below :
Now the data looks scaled . Lets proceed to split the data into train and test sets in 70:30 ratio as splitting
the similar data can minimize the effects of data discrepancies and better understand the characteristics of
the model . Splitting is also useful to avoid or check overfitting of the model.
For splitting the data into train and test sets , first we need separate the target and predictor variables into
two different data frames namely X and y where X will contain all predictor variables and y will contain the
target variable which is “vote” in this dataset.
We will split the data into 70:30 using train_test_split from sklearn library.
Logistic Regression
Now we will apply Logistic Regression on train set and perform the predictions on test set. Here I will be
using Grid Search CV to find out best hyperparameters to be used for building the Logistic Regression
model on the train data set . And the best parameters that should be used are solver=’newton-cg’ instead
of default solver =lbfgs , max_iter=’10000’ instead of default value as 100 ,penalty=’l2’, C=1.0,
class_weight=’dict’ and n_jobs as 2 which means number of CPU cores used when parallelizing over classes
to achieve better accuracy while predicting on test sets .
14
Inferences :
Precision (76%) – 76% of voters predicted have actually voted for Conservative party out of all the voters
predicted to vote for Conservative party.
Recall (72%) – Out of all the voters who have actually voted for Conservative party, 72% have been
predicted correctly .
Precision (86%) – 86% of voters predicted have actually voted for Labour party out of all the voters
predicted to vote for Labour party.
Recall (88%) – Out of all the voters who have actually voted for Labour party, 88% have been predicted
correctly .
Accuracy score and Precision for test data is almost inline with training data .This proves that no overfitting
or underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class
imbalance in the data.
15
Now we will apply LDA on train set and perform the predictions on test set. Here also I will be using Grid
Search CV to find out best hyperparameters to be used for building the LDA model on the train data set .
And the best parameters that should be used are solver=’lsqr’ instead of default solver =’svd’ and
shrinkage as ‘auto’ instead of default value as None to achieve better accuracy while predicting on test
sets.
Inferences :
Precision (76%) – 76% of voters predicted have actually voted for Conservative party out of all the voters
predicted to vote for Conservative party.
Recall (74%) – Out of all the voters who have actually voted for Conservative party, 74% have been
predicted correctly .
Precision (87%) – 87% of voters predicted have actually voted for Labour party out of all the voters
predicted to vote for Labour party.
Recall (88%) – Out of all the voters who have actually voted for Labour party, 88% have been predicted
correctly .
Accuracy score and Precision for test data is almost inline with training data . This proves that no
overfitting or underfitting has happened. However, recall has reduced for Class 1 of test data which is due
to class imbalance in the data.
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
KNN Model
We will apply KNeighbours Classifier on the training set and evaluate the model performance on test set .
First we will apply using default values which is n_neighbors=5 and we get the scores as below :
Now lets run the KNN with no of neighbours to be 1,3,5..19 and find the optimal number of
neighbours from K=1,3,5,7....19 using the Mis classification error
Note : Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with neighbours =
1,3,5...19 and find the model with lowest MCE
And as seen in the above graph for K=9 it is giving the least MCE of approx. 0.16, so we will build the model for K=9
and check its performance .
As the difference between train and test accuracies is less than 10%(1.4%), it is a valid model.
Accuracy score and Precision for test data is almost inline with training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data. And therefore, accuracy score will not be considered here as a measure to check the model evaluation as there
is an imbalance in the dataset. Our main goal here is to reduce the Type 2 error, i.e False-negative.
For naive bayes algorithm while calculating likelihoods of numerical features it assumes the feature to be normally
distributed and then we calculate probability using mean and variance of that feature only and also it assumes that
all the predictors are independent to each other. Hence, there are no hyperparameters as such which can be used to
optimise this model .
We will apply the GaussianNB classifier on the train set and check the predictions on the test set :
Lets do a check to ensure the model validity by analysing cross validation scores on train and test
sets.
After 10 fold cross validation, scores both on train and test data set respectively for all 10 folds are almost same.
Hence our model is valid.
Accuracy score and Precision for test data is almost inline with training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data.
Model tuning (hyperparameters) has already been done to the Logistic Regression, LDA, KNN and Naïve Bayes
Models .
Creating a Bagging model using Random Forest classifier as the base estimator, n_estimators as 100 and random
state as 1 as the hyperparameters on train data and checking the performance on test dataset.
Accuracy score and Precision for test data is not inline with the training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data.
Ada Boost
20
Creating Ada Boost model using Adaboost classifier from sklearn ensemble library and tuning the parameters
suggested by Gridsearch CV like learning rate as 1 and n_estimators as 10. We get the below accuracy scores :
Accuracy score and Precision for test data is almost inline with the training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data.
Gradient Boosting
Creating Gradient Boosting model using GradientBoosting classifier from sklearn ensemble library and tuning the
parameters suggested by Gridsearch CV like learning rate as 0.5 and n_estimators as 12. And, we get the below
accuracy scores :
Accuracy score and Precision for test data is not inline with the training data . But the difference between training
and test set scores is within the industry standards, so it can be accepted as a valid model. But this proves that some
overfitting or underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class
imbalance in the data.
1. Logistic Regression
Fig 15 ROC_AUC score and ROC curve of Train and Test Sets
Fig 17 ROC_AUC score and ROC curve of Train and Test Sets
3. KNN Model
Fig 19 ROC_AUC score and ROC curve of Train and Test Sets
Fig 21 ROC_AUC score and ROC curve of Train and Test Sets
Fig 23 ROC_AUC score and ROC curve of Train and Test Sets
6. Ada Boost
Fig 25 ROC_AUC score and ROC curve of Train and Test Sets
7. Gradient Boosting
Fig 27 ROC_AUC score and ROC curve of Train and Test Sets
Lets quickly compare all the performance metrics of above seven models and find out the best model
among all :
Table 25 Comparison Summary of Naïve Bayes, Bagging , Ada Boost & Gradient Boosting models
Accuracy score and Precision for Class 1 of Logistic Regression, LDA, KNN, Naïve Bayes and Ada Boost models
are almost inline with the testing data which indicates there is no overfitting or underfitting has happened.
ROC_AUC Scores of Logistic Regression, LDA, Naïve Bayes and Ada Boost models are almost inline with the
testing data .
Recall Scores of LDA, KNN and Naïve Bayes are almost inline with the testing data.
So, overall we can infer that LDA and Naive Bayes are the most optimized models from all the above mentioned
models . But as we know that there was class imbalance in the data , we will apply smote on the above 2 models i.e.
LDA and Naive Bayes to check if the performance has improved or not.
We can conclude that Naïve Bayes model’s performance is slightly better than LDA after smote although accuracy
has remained constant and ROC_AUC score has reduced but recall and precision scores have improved only for Class
1.
So we can infer that there is not much improvement in the models after applying SMOTE , hence Naïve Bayes model
before applying SMOTE is the best and most optimised model among all .
30
Fig 29 ROC_AUC score and ROC curve of Train and Test Sets
31
Precision (74%) – 74% of voters predicted have actually voted for Conservative party out of all the voters
predicted to vote for Conservative party.
Recall (73%) – Out of all the voters who have actually voted for Conservative party, 73% have been
predicted correctly .
Precision (87%) – 87% of voters predicted have actually voted for Labour party out of all the voters
predicted to vote for Labour party.
Recall (87%) – Out of all the voters who have actually voted for Labour party, 87% have been predicted
correctly .
Overall accuracy of the model – 82% of total predictions are correct and AUC score is also quite good which means
the model is able to better distinguish between the two classes.
Accuracy score and Precision for test data is almost inline with training data . This proves that no overfitting or
underfitting has happened. So overall it is a good and optimised model .
Encoded the categorical variable, bifurcated the data into train and test sets (70:30) and scaled the columns
which were on different scales.
Created different models by tuning their hyperparameters using Gridsearch CV, Cross-validation and Mis-
classification error.
Analysed and compared the performance metrics like accuracy scores, precision, recall , ROC_AUC scores
and Confusion Matrix for all the models to find out the best and optimised model among all .
Applied SMOTE on the two best models to correct class imbalance where synthetic samples are generated
for the minority class and compared the performance metrics of the models before and after applying the
smote.
Chose the final model based on the above analysis and commented on that model’s performance metrics.
Based on the above predictions of the final Naïve Bayes Model, following business insights can be drawn :
Voters will vote mostly for the Labour party and their chances of winning in the elections are quite high
compared to Conservative Party.
And the exit poll indicates that Labour party will get more votes as 82% of the total predictions are accurate.
END OF PROBLEM 1
Introduction
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be looking
at the following speeches of the Presidents of the United States of America:
2.1 Find the number of characters, words, and sentences for the
mentioned documents.
After importing all the 3 speeches from nltk library , lets check the number of characters , words and sentences in
each one of them :
33
Before removing stopwords , lets do some pre-processing or cleaning of the texts in each of the three 3 speeches
as per the below steps :
1. Speech of President Franklin D. Roosevelt in 1941 : Checking word count before and after
removal of stopwords in this speech text and displaying a sample sentence after removal of stopwords .
2. Speech of President John F. Kennedy in 1961 : Checking word count before and after removal of
stopwords in this speech text and displaying a sample sentence after removal of stopwords .
3. Speech of President Richard Nixon in 1973 : Checking word count before and after removal of
stopwords in this speech text and displaying a sample sentence after removal of stopwords .
Note : Word count in question 1 and question 2 (word count before removal of stopwords) is different because
when we do .words it includes spaces as well and after text cleaning it only has words without spaces .
2.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)
We can find the words which occurs the most number of times by nltk.FreqDist() function
34
1. Speech of President Franklin D. Roosevelt in 1941 : In this speech, below three words occur most
number of times :
Nation : 17 times
Know : 10 times
Peopl : 9 times
2. Speech of President John F. Kennedy in 1961 : In this speech, below three words occur most
number of times :
Let : 16 times
Us : 12 times
Power : 9 times
3. Speech of President Richard Nixon in 1973 : In this speech, below three words occur most number
of times :
Us : 26 times
Let : 22 times
America : 21 times
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords)
Now we will plot the word cloud of the most used words in each of the three speeches using WordCloud from
matplotlib library .
END OF PROBLEM 2
36