0% found this document useful (0 votes)

96 views

Machine Learning Project

This document discusses machine learning and focuses on how it uses data and algorithms to gradually improve accuracy, similar to how humans learn. Machine learning is a branch of artificial intelligence that analyzes large amounts of data to identify patterns that can be used to make predictions. The goal is for machines to learn automatically through exposure to vast amounts of data without being explicitly programmed.

Uploaded by

Pranjal Singh

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views

Machine Learning Project

Uploaded by

Pranjal Singh

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 37

It is a branch of artificial

intelligence (AI) and

computer science which
focuses on the use of
data and algorithms to
imitate the way that
humans learn, gradually
improving its accuracy.

MACHINE
LEARNING
PROJECT

Created by Pranjal Singh

PGP-DSBA Online
05/03/2023
1

Table of Contents
Contents
Problem 1 Executive Summary………………………………………………………………………………………………………………………3
Introduction…………………………………………………………………………………………………………………………………………………..3
Data Description…………………………………………………………………………………………………………………………………………….3
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check…………………………….. 4-6
Sample……………………………………………………………………………………………………………………………………………4
Shape………………………………………………………………………………………………………………………………………………4
Data Types……………………………………………………………………………………………………………………………………….4
Null Value Check……………………………………………………………………………………………………………………………..5
Summary Stats of Numerical Columns…………………………………………………………………………………………….5
Summary Stats of Categorical Columns……………………………………………………………………………………………6
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers…………………….6-12
Checking for Duplicates and its treatment………………………………………………………………………………………..6
Univariate Analysis……………………………………………………………………………………………………………………………6-10
Bivariate Analysis………………………………………………………………………………………………………………………………10-11
Outlier Check…………………………………………………………………………………………………………………………………….11-12
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test………………………………………………………………………………………………………………….12-13
1.4 Apply Logistic Regression and LDA (linear discriminant analysis)………………………………………………………………….13-16
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results………………………………………………………………….16-19
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging) and Boosting………………………………….19-21
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model.
Final Model: Compare the models and write inference which model is best/optimized………………………………..21-31
1.8 Based on these predictions, what are the insights?...........................................................................................31-32

Problem 2 Introduction…………………………………………………………………………………………………………………………………………32
2.1 Find the number of characters, words, and sentences for the mentioned documents……………………………………32-33
2.2 Remove all the stopwords from all three speeches.……………………………………………………………………………………….33
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words………………………………………………………………………………………………………………………………33-34
2.4 Plot the word cloud of each of the speeches of the variable. ………………………………………………………………………..34-35

List of Figures

Problem 1
Fig. 1- age Histplot & Boxplot……………………………………………………………………………………………………………………………………6
Fig. 2- economic.cond.national Histplot & Boxplot…………………………………………………………………………………………………….7
Fig. 3- economic.cond.household Histplot & Boxplot……………………………………………………………………………………………….7
Fig. 4 - Blair Histplot & Boxplot………………………………………………………………………………………………………………………………….8
Fig. 5 - Hague Histplot & Boxplot………………………………………………………………………………………………………………………………8
Fig. 6 - Europe Histplot & Boxplot…………………………………………………………………………………………………………………………….9
Fig. 7 - political.knowledge Histplot & Boxplot…………………………………………………………………………………………………………9
Fig. 8 - vote Countplot……………………………………………………………………………………………………………………………………………….10
Fig. 9 - gender Countplot………………………………………………………………………………………………………………………………………….10
Fig. 10 - vote v/s age Stripplot………………………………………………………………………………………………………………………………….10
Fig. 11 - Correlation Heatmap……………………………………………………………………………………………………………………………………11
Fig. 12 - Numerical Columns with Outliers…………………………………………………………………………………………………………………11
Fig. 13 - MCE v/s K-Neighbours Plot………………………………………………………………………………………………………………………….17
2

Fig. 14 - Confusion Matrix of Train and Test sets-Logistic Regression……………………………………………………………………….22

Fig. 15 - ROC_AUC score and ROC curve of Train and Test Sets-Logistic Regression………………………………………………….22
Fig. 16- Confusion Matrix of Train and Test sets-LDA ………………………………………………………………………………………………23
Fig. 17 - ROC_AUC score and ROC curve of Train and Test Sets-LDA…………………………………………………………………………23
Fig. 18- Confusion Matrix of Train and Test sets- KNN………………………………………………………………………………………………24
Fig. 19 - ROC_AUC score and ROC curve of Train and Test Sets-KNN………………………………………………………………………..24
Fig. 20- Confusion Matrix of Train and Test sets-Naïve Bayes ………………………………………………………………………………….25
Fig. 21 - ROC_AUC score and ROC curve of Train and Test Sets-Naïve Bayes……………………………………………………………..25
Fig. 22- Confusion Matrix of Train and Test sets-Bagging(RF) …………………………………………………………………………………..26
Fig. 23 - ROC_AUC score and ROC curve of Train and Test Sets-Bagging(RF) …………………………………………………………….26
Fig. 24 - Confusion Matrix of Train and Test sets-Ada Boost…………………………………………………………………………………….27
Fig. 25 - ROC_AUC score and ROC curve of Train and Test Sets-Ada Boost……………………………………………………………….27
Fig. 26- Confusion Matrix of Train and Test sets-Gradient Boosting………………………………………………………………………….28
Fig. 27 - ROC_AUC score and ROC curve of Train and Test Sets-Gradient Boosting……………………………………………………28
Fig. 28- Confusion Matrix of Train and Test sets-Final Model …………………………………………………………………………………..30
Fig. 29 - ROC_AUC score and ROC curve of Train and Test Sets-Final Model…………………………………………………………….30

Problem 2
Fig. 30 - WordCloud of President Franklin D. Roosevelt’s Speech 1941……………………………………………………………………..34
Fig. 31 - WordCloud of President John F. Kennedy’s Speech 1961…………………………………………………………………………….35
Fig. 32 - WordCloud of President Richard Nixon’s Speech 1973………………………………………………………………………………..35

List of Tables

Problem 1
Table 1 : Dataset Sample…………………………………………………………………………………………………………………………………………....4
Table 2 : Data type table…………………………………………………………………………………………………………………………………………….4
Table 3 : Null value check table………………………………………………………………………………………………………………………………….5
Table 4 : Summary of Numerical Columns………………………………………………………………………………………………………………….5
Table 5 : Summary of Categorical Columns………………………………………………………………………………………………………………..6
Table 6 : Duplicates table …………………………………………………………………………………………………..........................................6
Table 7 : Sample dataset after Encoding…………………………………………………………………………………………………...................12
Table 8 : Five-Point Summary Before Scaling……………………………………………………………………………………………………………12
Table 9 : Five-Point Summary After Scaling………………………………………………………………………………………………………………13
Table 10 : Classification Report of Train Data-Logistic Regression……………………………………………………………………………14
Table 11 : Classification Report of Test Data-Logistic Regression………………………………………………………………………………14
Table 12 : Classification Report of Train Data-LDA……………………………………………………………………………………………………15
Table 13 : Classification Report of Test Data-LDA………………………………………………………………………………………………………15
Table 14 : Classification Report of Train Data-KNN……………………………………………………………………………………………………17
Table 15 : Classification Report of Test Data-KNN…………………………………………………………………………………………………….17
Table 16 : Classification Report of Train Data-Naïve Bayes………………………………………………………………………………………18
Table 17 : Classification Report of Test Data-Naïve Bayes…………………………………………………………………………………………18
Table 18 : Classification Report of Train Data-Bagging(RF) ………………………………………………………………………………………19
Table 19 : Classification Report of Test Data-Bagging(RF) ………………………………………………………………………………………19
Table 20 : Classification Report of Train Data-Ada Boost…………………………………………………………………………………………20
Table 21 : Classification Report of Test Data-Ada Boost…………………………………………………………………………………………..20
Table 22 : Classification Report of Train Data-Gradient Boosting………………………………………………………………………………20
Table 23 : Classification Report of Test Data- Gradient Boosting………………………………………………………………………………21
Table 24: Comparison Summary of Logistic Regression, LDA & KNN models……………………………………………………………28
Table 25: Comparison Summary of Naïve Bayes, Bagging , Ada Boost & Gradient Boosting models…………………………28
Table 26 : Comparison Summary of LDA & Naïve Bayes after SMOTE………………………………………………………………………29
Table 27 : Comparing performances before and after SMOTE………………………………………………………………………………….29
3

Table 28 : Classification Report of Train Data-Final Model………………………………………………………………………………………30

Table 29 : Classification Report of Test Data-Final Model…………………………………………………………………………………………31

Datasets Used

Dataset for Problem 1: Election_Data.xlsx

Dataset for Problem 2: Inaugral corpora from nltk

Problem 1 Data Modelling

Executive Summary
You are hired by one of the leading news channels CNBE who wants to analyse recent elections. This
election dataset contains a survey that was conducted on 1525 voters with 9 variables.

Introduction
The purpose of this whole exercise is to build a model, to predict which party a voter will vote for on the
basis of the given information, to create an exit poll that will help in predicting overall win and seats
covered by a particular party.

Data Description
System measures used:
Vote: Party choice: Conservative or Labour
age: in years
economic.cond.national: Assessment of current national economic conditions, 1 to 5.
economic.cond.household: Assessment of current household economic conditions, 1 to 5.
Blair: Assessment of the Labour leader, 1 to 5.
Hague: Assessment of the Conservative leader, 1 to 5.
Europe: an 11-point scale that measures respondents' attitudes toward European integration. High scores
represent ‘Eurosceptic’ sentiment.
political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.
gender: female or male.

1.1 Read the dataset. Do the descriptive statistics and do the null value
condition check. Write an inference on it.

Sample of the dataset:

Table 1 Dataset Sample

Shape of the dataset:

The data has 10 columns and 1525 rows .

Data Types:

Let us check the types of variables in the data frame.

Table 2 Data type table

Out of 10, 8 columns are of integer type and rest 2 columns are of object data type.

Null Value Check:

Table 3 Null value check table

As seen in the above table, there are no null values in the dataset .

Also, we will drop the Unnamed column as it is insignificant.

Let’s also try to test whether any categorical attribute contains a “?” in it or not. At times there exists “?” or
” ” in place of missing values. As seen from the code output in juptyer file , there are no "?" or " " present
in the data set.

Summary stats of Numerical Columns:

Table 4 Summary of Numerical Columns

From the above summary, we can infer that the average age of the voters is around 54 years and minimum
age is 24 years while maximum age is 93 years and also there is major difference between the 75
percentile value and maximum value of age column which means that age feature is slightly skewed to the
right and does not follow a normal distribution.

All other features except age represents normal distribution as the difference is not huge between 75
percentile value and maximum value.

Summary stats of Categorical Columns:

Table 5 Summary of Categorical Columns

From the above summary, we can infer that Labour party has received maximum number of votes and
there are slightly more number of female voters compared to male voters.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data

analysis. Check for Outliers.

Checking for duplicates :

Table 6 Duplicates table

There are 8 duplicates in the dataset . So lets remove them first and then do the analysis.

Univariate Analysis

Fig 1 age Histplot & Boxplot

Age feature is slightly right skewed because of which the data is not normally distributed and most number
of voters lies within 40 to 80 age group.
7

Fig 2 economic.cond.national Histplot & Boxplot

This feature has one outlier in the lower values and it does not show any distribution because it is a
categorical feature coded as ordinal numbers from 1 to 5 as already mentioned in the data dictionary.

Fig 3 economic.cond.household Histplot & Boxplot

This feature also has one outlier in the lower values and it does not show any distribution because it is a
categorical feature coded as ordinal numbers from 1 to 5 as already mentioned in the data dictionary.
8

Fig 4 Blair Histplot & Boxplot

This feature has no outlier and it does not show any distribution because it is a categorical feature coded
as ordinal numbers from 1 to 5 as already mentioned in the data dictionary. And most of the labour party
leader assessment grades lies between 2 to 4.

Fig 5 Hague Histplot & Boxplot

This feature also has no outlier and it does not show any distribution because it is a categorical feature in
coded form from 1 to 5 as already mentioned in the data dictionary.
9

Fig 6 Europe Histplot & Boxplot

This feature represents n 11-point scale that measures respondents' attitudes toward European
integration. High scores represent ‘Eurosceptic’ sentiment. And as seen in the above plots , there is more
data distribution in 6 to 10 scores range which indicates Eurosceptic sentiment.

Fig 7 political.knowledge Histplot & Boxplot

From the above plots , it can be inferred that most of the respondents’ knowledge of parties' positions on
European integration is quite low.
10

Fig 8 vote Countplot Fig 9 gender Countplot

Bivariate Analysis

Fig 10 vote v/s age Stripplot

From the above plot we can infer that aged people of 85 and above have voted for Conservative party .

Correlation Heatmap
11

Fig 11 Correlation Heatmap

There is hardly any correlation between any of the columns in this dataset.

As there is no correlation between any of the columns , doing a multivariate analysis using scatterplot
makes no sense for this dataset .

Outlier Check

Fig 12 Numerical Columns with Outliers

Only 2 outliers can be seen in the economic.cond.household and economic.cond.national columns but we
decide not to treat these outliers as these columns have ordinal set of numbers and outliers are to be
treated for only continuous columns analyses and also because there are very less number of outliers
present in the dataset .

1.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and test
(70:30).
We will only encode the “gender” column using get dummies as all other categorical columns have ordinal
values and are of integer data type , so there is no need to encode them again.

The target column “vote” has two equally important classes but has object data type , so this needs to be
converted into integer data type for performing different models , hence we will use Label Encoder which
will convert the target variable into numeric and encode the data as well as 1 being the Labour party and 0
being the Conservative Party respectively because the maximum number of votes have been casted for
Labour party , hence it will be tagged as 1.

Table 7 Sample dataset after Encoding

Lets check whether scaling is necessary or not for this dataset by analysing the mean, standard deviation
and variance of all numerical features :

Table 8 Five-Point Summary Before Scaling

We can observe that only age and Europe feature requires scaling as its mean, standard deviation and
variance are not on the same scale unlike other features. Although Europe column is in ordinal state
ranking from 0 to 11 but its needs to be scaled as its range is different from the other columns. Scaling is a
necessity when using Distance-based models such as KNN etc. It also helps stabilize the accuracy of a
model and makes training faster. And without scaling , the algorithm may be biased toward the feature
with values higher in magnitude.

So we will use the MinMax Scaler from Sklearn library to scale the age and Europe columns and re-check
the five point summary again after scaling as below :

Table 9 Five-Point Summary After Scaling

Now the data looks scaled . Lets proceed to split the data into train and test sets in 70:30 ratio as splitting
the similar data can minimize the effects of data discrepancies and better understand the characteristics of
the model . Splitting is also useful to avoid or check overfitting of the model.

For splitting the data into train and test sets , first we need separate the target and predictor variables into
two different data frames namely X and y where X will contain all predictor variables and y will contain the
target variable which is “vote” in this dataset.

We will split the data into 70:30 using train_test_split from sklearn library.

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Logistic Regression

Now we will apply Logistic Regression on train set and perform the predictions on test set. Here I will be
using Grid Search CV to find out best hyperparameters to be used for building the Logistic Regression
model on the train data set . And the best parameters that should be used are solver=’newton-cg’ instead
of default solver =lbfgs , max_iter=’10000’ instead of default value as 100 ,penalty=’l2’, C=1.0,
class_weight=’dict’ and n_jobs as 2 which means number of CPU cores used when parallelizing over classes
to achieve better accuracy while predicting on test sets .
14

 Accuracy Score on Training Data : 0.84

 Accuracy Score on Testing Data : 0.83

Table 10 Classification Report of Train Data

Table 11 Classification Report of Test Data

Inferences :

For predicting votes for Conservative Party (Label 0)

 Precision (76%) – 76% of voters predicted have actually voted for Conservative party out of all the voters
predicted to vote for Conservative party.
 Recall (72%) – Out of all the voters who have actually voted for Conservative party, 72% have been
predicted correctly .

For predicting votes for Labour Party (Label 1)

 Precision (86%) – 86% of voters predicted have actually voted for Labour party out of all the voters
predicted to vote for Labour party.
 Recall (88%) – Out of all the voters who have actually voted for Labour party, 88% have been predicted
correctly .

Overall accuracy of the model – 83% of total predictions are correct.

LDA (Linear Discriminant Analysis)

Now we will apply LDA on train set and perform the predictions on test set. Here also I will be using Grid
Search CV to find out best hyperparameters to be used for building the LDA model on the train data set .
And the best parameters that should be used are solver=’lsqr’ instead of default solver =’svd’ and
shrinkage as ‘auto’ instead of default value as None to achieve better accuracy while predicting on test
sets.

 Accuracy Score on Training Data : 0.83

 Accuracy Score on Testing Data : 0.84

Table 12 Classification Report of Train Data

Table 13 Classification Report of Test Data

Inferences :

Linear Discriminant Function = 1.77 + (-1.34age) + (0.63economic.cond.national) +

(0.08*economic.cond.household) + (0.77*Blair) + (-0.94*Hague) + (-2.39*Europe) + (-
0.43*political.knowledge) + (0.12*gender_male)

By the above equation and the coefficients it is clear that:

 predictor 'Blair’ has the largest magnitude thus this helps in classifying the best.
 predictor ‘Europe’ has the smallest magnitude thus this helps in classifying the least.

For predicting votes for Conservative Party (Label 0)

 Precision (76%) – 76% of voters predicted have actually voted for Conservative party out of all the voters
predicted to vote for Conservative party.
 Recall (74%) – Out of all the voters who have actually voted for Conservative party, 74% have been
predicted correctly .

For predicting votes for Labour Party (Label 1)

 Precision (87%) – 87% of voters predicted have actually voted for Labour party out of all the voters
predicted to vote for Labour party.
 Recall (88%) – Out of all the voters who have actually voted for Labour party, 88% have been predicted
correctly .

Overall accuracy of the model – 84% of total predictions are correct

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

KNN Model

We will apply KNeighbours Classifier on the training set and evaluate the model performance on test set .
First we will apply using default values which is n_neighbors=5 and we get the scores as below :

 Accuracy Score on Training Data : 0.85

 Accuracy Score on Testing Data : 0.81

Now lets run the KNN with no of neighbours to be 1,3,5..19 and find the optimal number of
neighbours from K=1,3,5,7....19 using the Mis classification error
Note : Misclassification error (MCE) = 1 - Test accuracy score. Calculated MCE for each model with neighbours =
1,3,5...19 and find the model with lowest MCE

Plotting misclassification error vs k (with k value on X-axis) as below :

Fig 13 MCE v/s K-Neighbours Plot

And as seen in the above graph for K=9 it is giving the least MCE of approx. 0.16, so we will build the model for K=9
and check its performance .

 Accuracy Score on Training Data : 0.85

 Accuracy Score on Testing Data : 0.83

Table 14 Classification Report of Train Data

Table 15 Classification Report of Test Data

As the difference between train and test accuracies is less than 10%(1.4%), it is a valid model.

Overall accuracy of the model – 83% of total predictions are correct

Accuracy score and Precision for test data is almost inline with training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data. And therefore, accuracy score will not be considered here as a measure to check the model evaluation as there
is an imbalance in the dataset. Our main goal here is to reduce the Type 2 error, i.e False-negative.

Naïve Bayes Model

For naive bayes algorithm while calculating likelihoods of numerical features it assumes the feature to be normally
distributed and then we calculate probability using mean and variance of that feature only and also it assumes that
all the predictors are independent to each other. Hence, there are no hyperparameters as such which can be used to
optimise this model .

We will apply the GaussianNB classifier on the train set and check the predictions on the test set :

 Accuracy Score on Training Data : 0.84

 Accuracy Score on Testing Data : 0.82

Lets do a check to ensure the model validity by analysing cross validation scores on train and test
sets.

After 10 fold cross validation, scores both on train and test data set respectively for all 10 folds are almost same.
Hence our model is valid.

 Train Score : 0.83

 Test Score : 0.83

Table 16 Classification Report of Train Data

Table 17 Classification Report of Test Data

Overall accuracy of the model – 82% of total predictions are correct.

1.9 Model Tuning, Bagging (Random Forest should be applied for

Bagging), and Boosting.

Model tuning (hyperparameters) has already been done to the Logistic Regression, LDA, KNN and Naïve Bayes
Models .

Bagging (Using Random Forest as classifier)

Creating a Bagging model using Random Forest classifier as the base estimator, n_estimators as 100 and random
state as 1 as the hyperparameters on train data and checking the performance on test dataset.

 Accuracy Score on Training Data : 0.97

 Accuracy Score on Testing Data : 0.83

Table 18 Classification Report of Train Data

Table 19 Classification Report of Test Data

Overall accuracy of the model – 83% of total predictions are correct

Accuracy score and Precision for test data is not inline with the training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data.

Ada Boost
20

Creating Ada Boost model using Adaboost classifier from sklearn ensemble library and tuning the parameters
suggested by Gridsearch CV like learning rate as 1 and n_estimators as 10. We get the below accuracy scores :

 Accuracy Score on Training Data : 0.84

 Accuracy Score on Testing Data : 0.82

Table 20 Classification Report of Train Data

Table 21 Classification Report of Test Data

Overall accuracy of the model – 82% of total predictions are correct

Accuracy score and Precision for test data is almost inline with the training data . This proves that no overfitting or
underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class imbalance in the
data.

Gradient Boosting

Creating Gradient Boosting model using GradientBoosting classifier from sklearn ensemble library and tuning the
parameters suggested by Gridsearch CV like learning rate as 0.5 and n_estimators as 12. And, we get the below
accuracy scores :

 Accuracy Score on Training Data : 0.89

 Accuracy Score on Testing Data : 0.84
21

Table 22 Classification Report of Train Data

Table 23 Classification Report of Test Data

Overall accuracy of the model – 84% of total predictions are correct

Accuracy score and Precision for test data is not inline with the training data . But the difference between training
and test set scores is within the industry standards, so it can be accepted as a valid model. But this proves that some
overfitting or underfitting has happened. However, recall has reduced for Class 1 of test data which is due to class
imbalance in the data.

1.7 Performance Metrics: Check the performance of Predictions on Train

and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and
write inference which model is best/optimized.
Lets check and compare the performance of predictions on Train and Test sets using Accuracy ,
Confusion Matrix, ROC_AUC scores of all the models and find out the best model among all .

1. Logistic Regression

 Accuracy Score on Training Data : 0.84

 Accuracy Score on Testing Data : 0.83
22

Fig 14 Confusion Matrix of Train and Test sets

Fig 15 ROC_AUC score and ROC curve of Train and Test Sets

2. LDA (Linear Discriminant Analysis)

 Accuracy Score on Training Data : 0.83

 Accuracy Score on Testing Data : 0.84
23

Fig 16 Confusion Matrix of Train and Test sets

Fig 17 ROC_AUC score and ROC curve of Train and Test Sets

3. KNN Model

 Accuracy Score on Training Data : 0.85

 Accuracy Score on Testing Data : 0.83
24

Fig 18 Confusion Matrix of Train and Test sets

Fig 19 ROC_AUC score and ROC curve of Train and Test Sets

4. Naïve Bayes Model

 Accuracy Score on Training Data : 0.84

 Accuracy Score on Testing Data : 0.82
25

Fig 20 Confusion Matrix of Train and Test sets

Fig 21 ROC_AUC score and ROC curve of Train and Test Sets

5. Bagging (Using Random Forest as classifier)

 Accuracy Score on Training Data : 0.97

 Accuracy Score on Testing Data : 0.83
26

Fig 22 Confusion Matrix of Train and Test sets

Fig 23 ROC_AUC score and ROC curve of Train and Test Sets

6. Ada Boost

 Accuracy Score on Training Data : 0.84

 Accuracy Score on Testing Data : 0.82
27

Fig 24 Confusion Matrix of Train and Test sets

Fig 25 ROC_AUC score and ROC curve of Train and Test Sets

7. Gradient Boosting

 Accuracy Score on Training Data : 0.89

 Accuracy Score on Testing Data : 0.84
28

Fig 26 Confusion Matrix of Train and Test sets

Fig 27 ROC_AUC score and ROC curve of Train and Test Sets

Lets quickly compare all the performance metrics of above seven models and find out the best model
among all :

Table 24 Comparison Summary of Logistic Regression, LDA & KNN models

Table 25 Comparison Summary of Naïve Bayes, Bagging , Ada Boost & Gradient Boosting models

As per the above summary , we can conclude below inferences :

 Accuracy score and Precision for Class 1 of Logistic Regression, LDA, KNN, Naïve Bayes and Ada Boost models
are almost inline with the testing data which indicates there is no overfitting or underfitting has happened.
 ROC_AUC Scores of Logistic Regression, LDA, Naïve Bayes and Ada Boost models are almost inline with the
testing data .
 Recall Scores of LDA, KNN and Naïve Bayes are almost inline with the testing data.

So, overall we can infer that LDA and Naive Bayes are the most optimized models from all the above mentioned
models . But as we know that there was class imbalance in the data , we will apply smote on the above 2 models i.e.
LDA and Naive Bayes to check if the performance has improved or not.

Table 26 Comparison Summary of LDA & Naïve Bayes after SMOTE

Table 27 Comparing performances before and after SMOTE

We can conclude that Naïve Bayes model’s performance is slightly better than LDA after smote although accuracy
has remained constant and ROC_AUC score has reduced but recall and precision scores have improved only for Class
1.

So we can infer that there is not much improvement in the models after applying SMOTE , hence Naïve Bayes model
before applying SMOTE is the best and most optimised model among all .
30

Final Model is Naïve Bayes and has below performance metrics :

 Accuracy Score on Training Data : 0.84

 Accuracy Score on Testing Data : 0.82

Fig 28 Confusion Matrix of Train and Test sets

Fig 29 ROC_AUC score and ROC curve of Train and Test Sets
31

Table 28 Classification Report of Train Data

Table 29 Classification Report of Test Data

For predicting votes for Conservative Party (Label 0)

 Precision (74%) – 74% of voters predicted have actually voted for Conservative party out of all the voters
predicted to vote for Conservative party.
 Recall (73%) – Out of all the voters who have actually voted for Conservative party, 73% have been
predicted correctly .

For predicting votes for Labour Party (Label 1)

 Precision (87%) – 87% of voters predicted have actually voted for Labour party out of all the voters
predicted to vote for Labour party.
 Recall (87%) – Out of all the voters who have actually voted for Labour party, 87% have been predicted
correctly .

Overall accuracy of the model – 82% of total predictions are correct and AUC score is also quite good which means
the model is able to better distinguish between the two classes.

Accuracy score and Precision for test data is almost inline with training data . This proves that no overfitting or
underfitting has happened. So overall it is a good and optimised model .

1.8 Based on these predictions, what are the insights?

Summing up all the above steps as below:

 Analysed the dataset thoroughly by doing EDA to analyse different variables and their relationship with each
other , pre-processed the data as there were some duplicates and checked for outliers .
32

 Encoded the categorical variable, bifurcated the data into train and test sets (70:30) and scaled the columns
which were on different scales.
 Created different models by tuning their hyperparameters using Gridsearch CV, Cross-validation and Mis-
classification error.
 Analysed and compared the performance metrics like accuracy scores, precision, recall , ROC_AUC scores
and Confusion Matrix for all the models to find out the best and optimised model among all .
 Applied SMOTE on the two best models to correct class imbalance where synthetic samples are generated
for the minority class and compared the performance metrics of the models before and after applying the
smote.
 Chose the final model based on the above analysis and commented on that model’s performance metrics.

Based on the above predictions of the final Naïve Bayes Model, following business insights can be drawn :
 Voters will vote mostly for the Labour party and their chances of winning in the elections are quite high
compared to Conservative Party.
 And the exit poll indicates that Labour party will get more votes as 82% of the total predictions are accurate.

END OF PROBLEM 1

Problem 2 Text Mining

Introduction
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be looking
at the following speeches of the Presidents of the United States of America:

1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the
mentioned documents.

After importing all the 3 speeches from nltk library , lets check the number of characters , words and sentences in
each one of them :
33

2.2 Remove all the stopwords from all three speeches.

Before removing stopwords , lets do some pre-processing or cleaning of the texts in each of the three 3 speeches
as per the below steps :

 Firstly we will convert all the speech text to lowercase .

 Then we will clean all special characters using re.sub() function for string substitution using regular
expressions.
 Now we will tokenize the text which means splitting the text files into words .
 And we will remove the stopwords which means removing the meaningless words.
 Finally we will stem the words to its root words using Porterstemmer.

1. Speech of President Franklin D. Roosevelt in 1941 : Checking word count before and after
removal of stopwords in this speech text and displaying a sample sentence after removal of stopwords .

Word count before removal of stopwords : 1348

Word count after removal of stopwords : 625

2. Speech of President John F. Kennedy in 1961 : Checking word count before and after removal of
stopwords in this speech text and displaying a sample sentence after removal of stopwords .

Word count before removal of stopwords : 1371

Word count after removal of stopwords : 688

3. Speech of President Richard Nixon in 1973 : Checking word count before and after removal of
stopwords in this speech text and displaying a sample sentence after removal of stopwords .

Word count before removal of stopwords : 1819

Word count after removal of stopwords : 833

Note : Word count in question 1 and question 2 (word count before removal of stopwords) is different because
when we do .words it includes spaces as well and after text cleaning it only has words without spaces .

2.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)

We can find the words which occurs the most number of times by nltk.FreqDist() function
34

1. Speech of President Franklin D. Roosevelt in 1941 : In this speech, below three words occur most
number of times :
Nation : 17 times
Know : 10 times
Peopl : 9 times

2. Speech of President John F. Kennedy in 1961 : In this speech, below three words occur most
number of times :
Let : 16 times
Us : 12 times
Power : 9 times

3. Speech of President Richard Nixon in 1973 : In this speech, below three words occur most number
of times :
Us : 26 times
Let : 22 times
America : 21 times

2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords)

Now we will plot the word cloud of the most used words in each of the three speeches using WordCloud from
matplotlib library .

Fig 30 WordCloud of President Franklin D. Roosevelt’s Speech 1941

Fig 31 WordCloud of President John F. Kennedy’s Speech 1961

Fig 32 WordCloud of President Richard Nixon’s Speech 1973

END OF PROBLEM 2
36

MCQ Practice Questions
100% (1)
MCQ Practice Questions
115 pages
Student Number: 57208816. Module Code: TMS3719. Assignment Number: 04. Unique Number: 612894
No ratings yet
Student Number: 57208816. Module Code: TMS3719. Assignment Number: 04. Unique Number: 612894
8 pages
International Corporate Finance 11 Edition: by Jeff Madura
No ratings yet
International Corporate Finance 11 Edition: by Jeff Madura
7 pages
Capstone Final Report - Natural Disaster App - Final
No ratings yet
Capstone Final Report - Natural Disaster App - Final
32 pages
SHARP - SPLIT TYPE - AIR - CONDITION - Ah-Ay-Au-Ae-X075e-X095e - SM - GB PDF
No ratings yet
SHARP - SPLIT TYPE - AIR - CONDITION - Ah-Ay-Au-Ae-X075e-X095e - SM - GB PDF
65 pages
Industrialization and Imperialism Choice Board
No ratings yet
Industrialization and Imperialism Choice Board
5 pages
Ethical Dilemma - Homelessness
No ratings yet
Ethical Dilemma - Homelessness
15 pages
Dong Ying PDF
No ratings yet
Dong Ying PDF
52 pages
Sheet 1
100% (1)
Sheet 1
3 pages
Assignments
No ratings yet
Assignments
6 pages
Flexmix Intro
No ratings yet
Flexmix Intro
18 pages
Hello 2
No ratings yet
Hello 2
2 pages
History Ge Assignment PDF
No ratings yet
History Ge Assignment PDF
7 pages
Elliptic Curve Cryptosystems in The Presence of Permanent and Transient Faults
No ratings yet
Elliptic Curve Cryptosystems in The Presence of Permanent and Transient Faults
14 pages
Supplier Trading Company Profile
No ratings yet
Supplier Trading Company Profile
9 pages
Bulk Materials With Intrinsic Intermediate Band: Data Study and Applicability To Solar Energy Conversion
0% (1)
Bulk Materials With Intrinsic Intermediate Band: Data Study and Applicability To Solar Energy Conversion
128 pages
3 UnSupervised Learning
No ratings yet
3 UnSupervised Learning
53 pages
IT Growth and Global Change: A Conversation With Ray Kurzweil
No ratings yet
IT Growth and Global Change: A Conversation With Ray Kurzweil
6 pages
Several Data Analysis and Processing of Electronic Nose Data Preprocessing Subsystem
No ratings yet
Several Data Analysis and Processing of Electronic Nose Data Preprocessing Subsystem
4 pages
LAS Math Grade 8 Q1 Week 11
No ratings yet
LAS Math Grade 8 Q1 Week 11
6 pages
JKNC Dossier
No ratings yet
JKNC Dossier
6 pages
Summary BRM Pidilite
No ratings yet
Summary BRM Pidilite
2 pages
Sample 21888
No ratings yet
Sample 21888
16 pages
QT Calculus
No ratings yet
QT Calculus
10 pages
National Budget Circular No 578 Dated January 6 2020
No ratings yet
National Budget Circular No 578 Dated January 6 2020
30 pages
International Trade Assessment - Decathlon
No ratings yet
International Trade Assessment - Decathlon
6 pages
Samsung GT-i9082 Galaxy Grand 07 Level 2 Repair - Assembly, Disassembly
No ratings yet
Samsung GT-i9082 Galaxy Grand 07 Level 2 Repair - Assembly, Disassembly
6 pages
NN VI Two-Pager
100% (1)
NN VI Two-Pager
2 pages
Optimization of Wind Farm Yaw Offset Angle Using Online Genetic Algorithm With A Modified Elitism Strategy To Maximize Power Production
No ratings yet
Optimization of Wind Farm Yaw Offset Angle Using Online Genetic Algorithm With A Modified Elitism Strategy To Maximize Power Production
15 pages
Super Goal Booklet 2 97581357
No ratings yet
Super Goal Booklet 2 97581357
45 pages
CRDD French Syllabus
No ratings yet
CRDD French Syllabus
103 pages
Before Li Ion Batteriesacs - Chemrev.8b00422
No ratings yet
Before Li Ion Batteriesacs - Chemrev.8b00422
24 pages
Excel2019 Pag3 4
No ratings yet
Excel2019 Pag3 4
2 pages
Lec 9 Derivative of Vector Valued Function
No ratings yet
Lec 9 Derivative of Vector Valued Function
3 pages
A 0.3V 0.705fJ/Conversion-step 10-Bit SAR ADC With Shifted Monotonic Switching Procedure in 90nm CMOS
No ratings yet
A 0.3V 0.705fJ/Conversion-step 10-Bit SAR ADC With Shifted Monotonic Switching Procedure in 90nm CMOS
5 pages
Lab Course File EC 601 DSP
No ratings yet
Lab Course File EC 601 DSP
17 pages
HR Com SRMTL Data Cleaning
No ratings yet
HR Com SRMTL Data Cleaning
3 pages
Contoh Outline Car
No ratings yet
Contoh Outline Car
8 pages
1 s2.0 S2666833521000381 Main
No ratings yet
1 s2.0 S2666833521000381 Main
11 pages
Auto Deal Jay TRAVIZ WITH PASSENGER VAN
No ratings yet
Auto Deal Jay TRAVIZ WITH PASSENGER VAN
2 pages
PHAR318 Study Guide 5
No ratings yet
PHAR318 Study Guide 5
4 pages
Using Multicast Hammer
No ratings yet
Using Multicast Hammer
5 pages
Journal of Building Material Science - Vol.3, Iss.1 June 2021
No ratings yet
Journal of Building Material Science - Vol.3, Iss.1 June 2021
62 pages
ANU123
No ratings yet
ANU123
94 pages
Ds Project Report
No ratings yet
Ds Project Report
22 pages
Hidraulica - Inventario Obras Arte Existentes R.2
No ratings yet
Hidraulica - Inventario Obras Arte Existentes R.2
42 pages
Acadia19 586
No ratings yet
Acadia19 586
10 pages
Namma Hola
No ratings yet
Namma Hola
6 pages
Group Assignment DM
No ratings yet
Group Assignment DM
17 pages
11th Commerce 1 Mark - Book Back Q - A (2021-22) - 2
No ratings yet
11th Commerce 1 Mark - Book Back Q - A (2021-22) - 2
10 pages
Order Flow
No ratings yet
Order Flow
7 pages
Untitled
No ratings yet
Untitled
17 pages
st4 l2 26aug PDF
No ratings yet
st4 l2 26aug PDF
24 pages
685-Article Text-1520-1-10-20180403
100% (1)
685-Article Text-1520-1-10-20180403
16 pages
Indian Contract Act 1872 EBCL CS Executive Hand Written Notes
No ratings yet
Indian Contract Act 1872 EBCL CS Executive Hand Written Notes
52 pages
2020 Omnibus Affidavit
No ratings yet
2020 Omnibus Affidavit
1 page
SeaFLIR 240-Datasheet-A4 PDF
No ratings yet
SeaFLIR 240-Datasheet-A4 PDF
2 pages
Internship Report Sample 2
No ratings yet
Internship Report Sample 2
31 pages
ERP
No ratings yet
ERP
1 page
Superior Mathematics From An Elementary Point of View
No ratings yet
Superior Mathematics From An Elementary Point of View
196 pages
TTM Chapter 4
No ratings yet
TTM Chapter 4
21 pages
Tilting Frying Pan PRINCE 500
No ratings yet
Tilting Frying Pan PRINCE 500
32 pages
Machine Learning Project
83% (6)
Machine Learning Project
37 pages
Chapter3 Statistics 2021 22
No ratings yet
Chapter3 Statistics 2021 22
35 pages
NCERT Solution For Cbse Class 9 Maths Chapter 15 Probability
No ratings yet
NCERT Solution For Cbse Class 9 Maths Chapter 15 Probability
6 pages
Term End Model Examination Question Paper - Fall - 2011-12: Use of The Statistical Tables Is Permitted
No ratings yet
Term End Model Examination Question Paper - Fall - 2011-12: Use of The Statistical Tables Is Permitted
4 pages
Methods Using Factorial
No ratings yet
Methods Using Factorial
2 pages
05 Handout 1
No ratings yet
05 Handout 1
13 pages
A4 - G10 - Q3 - Module 5 - MELC 7
No ratings yet
A4 - G10 - Q3 - Module 5 - MELC 7
9 pages
Sw/Drill Anova: Name: Amarila, Jan-Rhada I. Date: November 9, 2020 Grade and Section: 9 - Edison 2
No ratings yet
Sw/Drill Anova: Name: Amarila, Jan-Rhada I. Date: November 9, 2020 Grade and Section: 9 - Edison 2
3 pages
Probability PDF
No ratings yet
Probability PDF
7 pages
Rank PDF
No ratings yet
Rank PDF
14 pages
Diagnostic Test in Practical Research 2
No ratings yet
Diagnostic Test in Practical Research 2
7 pages
Statistical Sampling MS
No ratings yet
Statistical Sampling MS
10 pages
2.dependent Sample T-Test
No ratings yet
2.dependent Sample T-Test
8 pages
Chap 16 Sampling Zikmund
100% (3)
Chap 16 Sampling Zikmund
45 pages
TLE-EIM8 Q4M4Week4 PASSED NoAK
No ratings yet
TLE-EIM8 Q4M4Week4 PASSED NoAK
11 pages
EPGP 10 DSA Business Analysis QT Project Group 02
No ratings yet
EPGP 10 DSA Business Analysis QT Project Group 02
15 pages
5 Estrategias de Expertos para Flops Monótonos
100% (1)
5 Estrategias de Expertos para Flops Monótonos
34 pages
Rs Sampling
No ratings yet
Rs Sampling
1 page
Sampling
No ratings yet
Sampling
101 pages
Hubungan Antara Hospitalisasi Anak Dengan Tingkat Kecemasan Orang Tua
No ratings yet
Hubungan Antara Hospitalisasi Anak Dengan Tingkat Kecemasan Orang Tua
4 pages
Statistical Sampling by Konrath
No ratings yet
Statistical Sampling by Konrath
16 pages
Statistics Formulas: Parameters
No ratings yet
Statistics Formulas: Parameters
3 pages
Q3 Module 15
No ratings yet
Q3 Module 15
33 pages
Statistics Fundamentals
No ratings yet
Statistics Fundamentals
17 pages
Data Collection and Presentation
No ratings yet
Data Collection and Presentation
32 pages
Statistics Exercise Solution
100% (1)
Statistics Exercise Solution
19 pages
Significance Testing of Word Frequencies in Corpora
No ratings yet
Significance Testing of Word Frequencies in Corpora
52 pages
Stat 110 CH8
No ratings yet
Stat 110 CH8
24 pages