Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (2 votes)
337 views

Machine Learning Project Report

Machine Learning Project Report. 1. Voter data analysis, model development, analysis, comparison of performance metrics and predictions. 2. NLTK analysis of text and creation of word cloud of presidential speeches.

Uploaded by

Kyoto
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
337 views

Machine Learning Project Report

Machine Learning Project Report. 1. Voter data analysis, model development, analysis, comparison of performance metrics and predictions. 2. NLTK analysis of text and creation of word cloud of presidential speeches.

Uploaded by

Kyoto
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Machine Learning - Project

Machine Learning (ML)


Project Report

Date: 05th December 2021


Version 1.0
Machine Learning - Project

Table of Contents
1 Problem 1 Statement............................................................................................................... 1
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 marks) ............................................................................................. 1
1.1.1 Data Summary........................................................................................................... 1
1.1.2 Duplicated Data Summary ........................................................................................ 2
1.1.3 Descriptive Statistics ................................................................................................. 2
1.1.4 Sample Data .............................................................................................................. 3
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. (7 marks) ....................................................................................................................... 4
1.2.1 Univariate Analysis.................................................................................................... 4
Age ..................................................................................................................... 4
Numerical Categorical (Ordinal) Data ............................................................... 5
National Economic Condition ............................................................................ 5
Household Economic Condition ........................................................................ 6
Blair .................................................................................................................... 6
Hague ................................................................................................................. 7
Europe................................................................................................................ 7
Political Knowledge ........................................................................................... 8
Gender ............................................................................................................... 9
Vote ................................................................................................................. 10
Age Distributions Across Other Features ........................................................ 10
Vote Distributions Across Other Features ....................................................... 14
Bivariate Analysis ............................................................................................. 18
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30). (4 marks) ................................................ 20
1.3.1 Encoding of Categorical Data .................................................................................. 20
1.3.2 Spilt of Data – Train and Test .................................................................................. 20
1.3.3 Scaling Necessity ..................................................................................................... 21
What is Scaling ................................................................................................. 21
Why is Scaling Needed..................................................................................... 21
Machine Learning - Project

When is Scaling Done ...................................................................................... 22


Voter Dataset – Scaling Decision ..................................................................... 22
Voter Dataset – Scaling Process ...................................................................... 22
Voter Dataset - Before and After Scaling ........................................................ 22
1.3.4 Data Balancing Issue - SMOTE ................................................................................ 24
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks) ................ 25
1.4.1 Logistic Regression Model ...................................................................................... 25
Log. Reg. Model Step1- Data Split ................................................................... 25
Log. Reg. Model Step2- Model Build ............................................................... 25
Log. Reg. Model Performance ......................................................................... 25
1.4.2 Linear Discriminant Analysis Model........................................................................ 26
LDA Model Step1- Data Split ........................................................................... 26
LDA Model Step2- Model Build ....................................................................... 26
LDA Model Performance ................................................................................. 26
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks) ................ 27
1.5.1 KNN Model .............................................................................................................. 27
KNN Model Step1- Data Split........................................................................... 27
KNN Model Step2- Model Build....................................................................... 27
KNN Model Performance................................................................................. 27
1.5.2 Naïve Bayes Analysis Model ................................................................................... 28
Naïve Bayes Model Step1- Data Split .............................................................. 28
Naïve Bayes Model Step2- Model Build .......................................................... 28
Naïve Bayes Model Performance .................................................................... 28
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting. (7
marks) ........................................................................................................................................ 29
1.6.1 Bagging Models ....................................................................................................... 29
Bagging Classifier Model.................................................................................. 29
Random Forest Bagging Classifier ................................................................... 30
1.6.2 Boosting Models ..................................................................................................... 32
Ada Boost ......................................................................................................... 32
Gradient Boosting Model ................................................................................ 33
Machine Learning - Project

1.6.3 Model Tuning .......................................................................................................... 34


Data Used for Tuned Model Training .............................................................. 34
Log. Reg. Model - Model Tuning (Grid Search) ............................................... 35
LDA. Model - Model Tuning (Grid Search)....................................................... 36
KNN - Model Tuning (Neighbor-K Search) ....................................................... 36
Naïve Bayes Model - Model Tuning (Grid Search)........................................... 38
Bagging Classifier Model - Model Tuning (Grid Search) .................................. 38
Random Forest Bagging Model– Best Inputs (Grid Search) ............................ 39
Ada Boost Model - Model Tuning (Grid Search).............................................. 40
Gradient Boost Model - Model Tuning (Grid Search) ...................................... 41
1.6.4 All Model Scores ..................................................................................................... 42
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized. (7 marks) 44
1.7.1 Performance Metrics .............................................................................................. 44
Confusion Matrix ............................................................................................. 44
ROC Curve and AUC Score ............................................................................... 46
1.7.2 Model Performance Decisions ................................................................................ 47
1.7.3 Logistical Regression Model – Complete Performance .......................................... 48
1.7.4 LDA Model – Complete Performance ..................................................................... 50
1.7.5 KNN Model – Complete Performance .................................................................... 52
1.7.6 Naïve Bayes Model – Complete Performance ........................................................ 54
1.7.7 Bagging Classification Model – Complete Performance......................................... 56
1.7.8 Random Forest Bagging Model – Complete Performance ..................................... 58
1.7.9 Ada Boost Model – Complete Performance ........................................................... 60
1.7.10 Gradient Boost Model – Complete Performance ................................................... 62
1.7.11 All Models Performance Comparison ..................................................................... 64
1.7.12 Final Model Choice.................................................................................................. 68
1.8 Based on these predictions, what are the insights? (5 marks) ...................................... 69
2 Problem 2 Statement............................................................................................................. 70
2.1 Find the number of characters, words, and sentences for the mentioned documents. –
3 Marks ...................................................................................................................................... 70
Machine Learning - Project

2.1.1 Number of Characters............................................................................................. 70


2.1.2 Number of Words ................................................................................................... 71
2.1.3 Number of Sentences ............................................................................................. 72
2.2 Remove all the stop words from all three speeches. – 3 Marks .................................... 73
2.2.1 Lower Case Words .................................................................................................. 73
2.2.2 Stemming Words .................................................................................................... 73
1941-Roosevelt – Before and after Stemming ................................................ 73
1961-Kennedy – Before and after Stemming .................................................. 73
1973-NIxon – Before and after Stemming....................................................... 74
Conclusion on Stemming ................................................................................. 74
2.2.3 Stop Words Cleanup ............................................................................................... 74
1941-Roosevelt – Before and After Stop Words Cleanup ............................... 75
1961-Kennedy – Before and After Stop Words Cleanup ................................. 75
1973-Nixon – Before and After Stop Words Cleanup ..................................... 76
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stop words) – 3 Marks .......... 76
2.3.1 Most Frequent Words in Cleaned Speech .............................................................. 76
2.3.2 Most Frequent Words in Cleaned Speech (Updated Stop Words) ......................... 77
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stop
words) – 3 Marks ....................................................................................................................... 77
2.4.1 1941-Roosevelt Speech........................................................................................... 78
2.4.2 1961-Kennedy Speech ............................................................................................ 79
2.4.3 1973-Nixon Speech ................................................................................................. 80
Machine Learning - Project

List of Figures
Figure 1-1 Age Data – Boxplot and Histogram................................................................................ 4
Figure 1-2 Count of Voters Across Age Voters................................................................................ 4
Figure 1-3 NEC Data – Boxplot and Count Plot ............................................................................... 5
Figure 1-4 HEC Data – Boxplot and Count Plot ............................................................................... 6
Figure 1-5 HEC Data – Boxplot and Count Plot ............................................................................... 6
Figure 1-6 Hague Assessment Data – Boxplot and Count Plot ....................................................... 7
Figure 1-7 Europe Data – Boxplot and Count Plot .......................................................................... 8
Figure 1-8 PK Data – Boxplot and Count Plot ................................................................................. 9
Figure 1-9 Gender Count Plot ......................................................................................................... 9
Figure 1-10 Vote Count Plot.......................................................................................................... 10
Figure 1-11 Voter Age Groups – NEC and HEC Scoring................................................................. 11
Figure 1-12 Voter Age Groups – Blair-Hague Scoring ................................................................... 12
Figure 1-13 Voter Age Groups – Europe Scoring & Political Knowledge ...................................... 13
Figure 1-14 Voter Age Groups – Gender ...................................................................................... 14
Figure 1-15 Vote Decision- Age Groups ........................................................................................ 14
Figure 1-16 Vote Decision- HEC and NEC Scores .......................................................................... 15
Figure 1-17 Vote Decision- Blair-Hague Scores ............................................................................ 16
Figure 1-18 Vote Decision- Political Knowledge Scores................................................................ 16
Figure 1-19 Vote Decision- Europe Sentiment Scores .................................................................. 17
Figure 1-20 Vote Decision- Gender ............................................................................................... 17
Figure 1-21 Voters – Pair Plot ....................................................................................................... 18
Figure 1-22 Voters – Numerical Data Heat Map........................................................................... 19
Figure 1-23 Range of Data Distribution all Columns ..................................................................... 23
Figure 1-24 Range of Data Distribution all Columns After Scaling ............................................... 23
Figure 1-25 MCE-K Neighbors Plot................................................................................................ 37
Figure 1-26 Confusion Matrix ....................................................................................................... 44
Figure 1-27 Log. Reg. Training Data Confusion Matrix ................................................................. 48
Figure 1-28 Log. Reg. Test Data Confusion Matrix ....................................................................... 48
Figure 1-29 Log. Reg. Training Data ROC-AUC Curve.................................................................... 49
Figure 1-30 Log. Reg. Test Data ROC-AUC Curve .......................................................................... 49
Figure 1-31 LDA Training Data Confusion Matrix ......................................................................... 50
Figure 1-32 LDA Test Data Confusion Matrix ................................................................................ 50
Figure 1-33 LDA Training Data ROC-AUC Curve ............................................................................ 51
Figure 1-34 LDA Test Data ROC-AUC Curve .................................................................................. 51
Figure 1-35 KNN Training Data Confusion Matrix......................................................................... 52
Figure 1-36 KNN Test Data Confusion Matrix ............................................................................... 52
Figure 1-37 KNN Training Data ROC-AUC Curve ........................................................................... 53
Figure 1-38 KNN Test Data ROC-AUC Curve ................................................................................. 53
Figure 1-39 Naïve Bayes Training Data Confusion Matrix ............................................................ 54
Machine Learning - Project

Figure 1-40 Naïve Bayes Test Data Confusion Matrix................................................................... 54


Figure 1-41 Log. Reg. Training Data ROC-AUC Curve.................................................................... 55
Figure 1-42 Naïve Bayes Test Data ROC-AUC Curve ..................................................................... 55
Figure 1-43 Bagging Classification Training Data Confusion Matrix ............................................. 56
Figure 1-44 Bagging Classification Test Data Confusion Matrix ................................................... 56
Figure 1-45 Bagging Classification Training Data ROC-AUC Curve ............................................... 57
Figure 1-46 Bagging Classification Test Data ROC-AUC Curve ...................................................... 57
Figure 1-47 Random Forest Bagging Training Data Confusion Matrix ......................................... 58
Figure 1-48 Random Forest Bagging Test Data Confusion Matrix ................................................ 58
Figure 1-49 Random Forest Bagging Training Data ROC-AUC Curve ............................................ 59
Figure 1-50 Random Forest Bagging Test Data ROC-AUC Curve .................................................. 59
Figure 1-51 Ada Boost Training Data Confusion Matrix ............................................................... 60
Figure 1-52 Ada Boost Test Data Confusion Matrix...................................................................... 60
Figure 1-53 Ada Boost Training Data ROC-AUC Curve .................................................................. 61
Figure 1-54 Ada Boost Test Data ROC-AUC Curve ........................................................................ 61
Figure 1-55 Gradient Boost Training Data Confusion Matrix ....................................................... 62
Figure 1-56 Gradient Boost Test Data Confusion Matrix.............................................................. 62
Figure 1-57 Gradient Boost Training Data ROC-AUC Curve .......................................................... 63
Figure 1-58 Gradient Boost Test Data ROC-AUC Curve ................................................................ 63
Figure 1-59 All Models – Accuracy ................................................................................................ 65
Figure 1-60 All Models – Precision................................................................................................ 65
Figure 1-61 All Models – Recall ..................................................................................................... 66
Figure 1-62 All Models – F1-Score ................................................................................................ 66
Figure 1-63 All Models - AUC ........................................................................................................ 67
Figure 1-64 All Models – Training Data ROC ................................................................................. 67
Figure 1-65 All Models – Test Data ROC ....................................................................................... 68
Figure 2-1 Character Count ........................................................................................................... 71
Figure 2-2 Word Count ................................................................................................................. 71
Figure 2-3 Sentence Count............................................................................................................ 72
Figure 2-4 Word Count – Before and After Stop-Words Cleanup ................................................ 75
Figure 2-5 Word Cloud - 1941-Roosevelt Speech ......................................................................... 78
Figure 2-6 Word Cloud – 1961 – Kennedy Speech ....................................................................... 79
Figure 2-7 Word Cloud - 1973-Nixon Speech ............................................................................... 80

List of Tables
Table 1-1 Data Dictionary for Election Data Survey Details ........................................................... 1
Table 1-2 Sample of Duplicated Voter Data ................................................................................... 2
Table 1-3 Descriptive Statistics of Voter Data (Numerical Columns) ............................................. 3
Table 1-4 Sample Electorate Distribution Data .............................................................................. 3
Machine Learning - Project

Table 1-5 Count of Voters Across Age Voters ................................................................................. 4


Table 1-6 NEC Assessment Data Distribution ................................................................................. 5
Table 1-7 HEC Assessment Data Distribution ................................................................................. 6
Table 1-8 Blair Assessment Data Distribution ................................................................................ 7
Table 1-9 Hague Assessment Data Distribution ............................................................................. 7
Table 1-10 Europe Sentiment Score Distribution ........................................................................... 8
Table 1-11 PK Assessment Data Distribution.................................................................................. 9
Table 1-12 Gender Data Distribution .............................................................................................. 9
Table 1-13 Vote Data Distribution ................................................................................................ 10
Table 1-14 NEC & HEC Across Age Groups ................................................................................... 10
Table 1-15 Blair & Hague Across Age Groups ............................................................................... 11
Table 1-16 Europe Scoring & Political Knowledge Across Age Groups ........................................ 13
Table 1-17 Categorical Values to Numerical Number Codes ........................................................ 20
Table 1-18 Before and After Scaling - STD Comparison................................................................ 23
Table 1-19 Scaled Voter Data........................................................................................................ 24
Table 1-20 Before and After SMOTE ............................................................................................. 24
Table 1-21 Log. Reg. Model Scores ............................................................................................... 25
Table 1-22 LDA Model Scores ....................................................................................................... 26
Table 1-23 KNN Model Scores ...................................................................................................... 27
Table 1-24 Naïve Bayes Model Scores .......................................................................................... 28
Table 1-25 Bagging Model Scores ................................................................................................. 29
Table 1-26 RF-Computed Importance for All Features ................................................................. 31
Table 1-27 RF Bagging Model Scores ............................................................................................ 32
Table 1-28 Ada Boost Model Scores ............................................................................................. 33
Table 1-29 Gradient Boost Model Scores ..................................................................................... 34
Table 1-30 Data Used to Train Tuned Models .............................................................................. 35
Table 1-31 Log. Reg.-Simple Vs Tuned Model Comparison .......................................................... 36
Table 1-32 LDA-Simple Vs Tuned Model Comparison .................................................................. 36
Table 1-33 KNN – Scores for Different N ...................................................................................... 37
Table 1-34 KNN-Simple Vs Tuned Model Comparison ................................................................. 38
Table 1-35 Naïve Bayes-Simple Vs Tuned Model Comparison ..................................................... 38
Table 1-36 Bagging Classifier-Simple vs Tuned Model Comparison ............................................. 39
Table 1-37 RF-Computed Importance for All Features – GridSearchCV Best Parameters ........... 40
Table 1-38 RF-Simple vs Tuned Model Comparison ..................................................................... 40
Table 1-39 Ada Boost -Simple Vs Tuned Model Comparison ....................................................... 41
Table 1-40 Gradient Boost -Simple Vs Tuned Model Comparison ............................................... 42
Table 1-41 All Model Scores ......................................................................................................... 43
Table 1-42 Sample Classification Report ...................................................................................... 46
Table 1-43 Log. Reg. Training Data Classification Report ............................................................. 48
Table 1-44 Log. Reg. Training Data Classification Report ............................................................. 48
Table 1-45 Log. Reg. Metrics – Train and Test Data ..................................................................... 49
Machine Learning - Project

Table 1-46 LDA Training Data Classification Report ..................................................................... 50


Table 1-47 LDA Test Data Classification Report ............................................................................ 50
Table 1-48 LDA Metrics – Train and Test Data ............................................................................. 51
Table 1-49 KNN Training Data Classification Report..................................................................... 52
Table 1-50 KNN Test Data Classification Report ........................................................................... 52
Table 1-51 KNN Metrics – Train and Test Data ............................................................................. 53
Table 1-52 Naïve Bayes Training Data Classification Report ........................................................ 54
Table 1-53 Naïve Bayes Test Data Classification Report .............................................................. 54
Table 1-54 Naïve Bayes Metrics – Train and Test Data ................................................................ 55
Table 1-55 Bagging Classification Training Data Classification Report ......................................... 56
Table 1-56 Bagging Classification Test Data Classification Report ............................................... 56
Table 1-57 Bagging Classification Metrics – Train and Test Data ................................................. 57
Table 1-58 Random Forest Bagging Training Data Classification Report ..................................... 58
Table 1-59 Random Forest Bagging Test Data Classification Report ............................................ 58
Table 1-60 Random Forest Bagging Metrics – Train and Test Data.............................................. 59
Table 1-61 Ada Boost Training Data Classification Report ........................................................... 60
Table 1-62 Ada Boost Test Data Classification Report.................................................................. 60
Table 1-63 Ada Boost Metrics – Train and Test Data ................................................................... 61
Table 1-64 Gradient Boost Training Data Classification Report ................................................... 62
Table 1-65 Gradient Boost Test Data Classification Report .......................................................... 62
Table 1-66 Gradient Boost Metrics – Train and Test Data ........................................................... 63
Table 1-67 All Models Scores ........................................................................................................ 64
Table 2-1 Speech Raw Output + Character Count ........................................................................ 70
Table 2-2 Speech Words Output + Words Count ......................................................................... 71
Table 2-3 Speech Sentence Output + Sentence Count ................................................................. 72
Table 2-4 Speech Output Post Cleanup ........................................................................................ 74
Table 2-5 Words with Top Frequency ........................................................................................... 77
Table 2-6 Words with Top Frequency – After Updated Stop Words ............................................ 77

List of Formulae
Formula 1-1 Min-Max Calculation ................................................................................................ 22
Formula 1-2 Confusion Matrix – Accuracy.................................................................................... 45
Formula 1-3 Confusion Matrix - Precision .................................................................................... 45
Formula 1-4 Confusion Matrix - Recall ......................................................................................... 45
Formula 1-5 Confusion Matrix - Specificity .................................................................................. 46
Formula 1-6 Confusion Matrix – F1 Score .................................................................................... 46
Machine Learning - Project

1 Problem 1 Statement
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.

Variable Name Description


Vote* Party choice: conservative or labor
Age In years
National Economic Cond. Assessment of current national economic conditions, 1 to 5.
Household Economic Cond. Assessment of current household economic conditions, 1 to 5.
Blair Assessment of the labor leader, 1 to 5.
Hague Assessment of the conservative leader, 1 to 5.
An 11-point scale that measures respondents' attitudes toward
Europe European integration. High scores represent Euro skeptical
sentiment.
Political Knowledge Knowledge of parties' positions on European integration, 0 to 3.
Gender Female or male.
Table 1-1 Data Dictionary for Election Data Survey Details

* - Target Variable (data to be predicted by model)

1.1 Read the dataset. Do the descriptive statistics and do the null value condition
check. Write an inference on it. (4 marks)
1.1.1 Data Summary
The summary describes the data type and the number of data entries in each of the columns in
the dataset. The presence of null data and duplicated data is also noted.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1525 non-null int64
1 Vote 1525 non-null object
2 Age 1525 non-null int64
3 National Economic Cond. 1525 non-null int64
4 Household Economic Cond. 1525 non-null int64
5 Blair 1525 non-null int64
6 Hague 1525 non-null int64

1
Machine Learning - Project

7 Europe 1525 non-null int64


8 Political Knowledge 1525 non-null int64
9 Gender 1525 non-null object
dtypes: int64(8), object(2)
memory usage: 119.3+ KB

1. There are a total of 10 columns and 1,525 rows


2. Two of the columns hold categorical data
3. Rest of the columns hold numerical data
4. None of the columns hold any null data entries
5. Column Unnamed: 0 which held serial numbers from 1 to 1,525 is dropped after initial
data read. It is not used for further processing as it does not contribute to the Vote
prediction

1.1.2 Duplicated Data Summary


After dropping the Unnamed: 0 column, we see that 8 data rows are duplicated. Given the nature
of the data, this duplication can be expected.
As these duplicated records do not add any value to the study, it is safely excluded (dropped) in
the evaluation model creation. Once the duplicated data is removed, there are 1,517 rows that
will be used for model creation and evaluation.
A sample of the duplicated pairs are shown color coded.
National Household
Political
Vote Age Economic Economic Blair Hague Europe Gender
Knowledge
Cond. Cond.
916 Labour 29 4 4 4 2 2 2 female
1244 Labour 29 4 4 4 2 2 2 female
2 Labour 35 4 4 5 2 3 2 male
67 Labour 35 4 4 5 2 3 2 male
Table 1-2 Sample of Duplicated Voter Data

1.1.3 Descriptive Statistics


The descriptive statistics of all the numerical data is summarized.

mean std min 25% 50% 75% max mode


Age 54.24 15.7 24 41 53 67 93 37
National Economic Cond. 3.25 0.88 1 3 3 4 5 3
Household Economic Cond. 3.14 0.93 1 3 3 4 5 3
Blair 3.34 1.17 1 2 4 4 5 4
Hague 2.75 1.23 1 2 2 4 5 2
Europe 6.74 3.3 1 4 6 10 11 11

2
Machine Learning - Project

Political Knowledge 1.54 1.08 0 0 2 2 3 2


Table 1-3 Descriptive Statistics of Voter Data (Numerical Columns)

1. We observe that the min/max values of all columns are valid.


2. There is a wide range of ages in the data. It will be interesting to see how each group is divided
on the other column data.
3. Age is the only column with actual numerical data. Although the other columns have
numerical data, they are ordinal in nature i.e. have a graded scale.

1.1.4 Sample Data


A sample of the original data (with dropped column) set is as below.
National Household
Political
Vote Age Economic Economic Blair Hague Europe Gender
Knowledge
Cond. Cond.
0 Labour 43 3 3 4 1 2 2 female
1 Labour 36 4 4 4 4 5 2 male
2 Labour 35 4 4 5 2 3 2 male
3 Labour 24 4 2 2 1 4 0 female
4 Labour 41 2 2 1 1 6 2 male
Table 1-4 Sample Electorate Distribution Data

3
Machine Learning - Project

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check
for Outliers. (7 marks)
1.2.1 Univariate Analysis
Age
Distribution: The age data is fairly normally distributed with the a very wide top indicating that
the number of people are evenly distributed across all ages.
Skew: Data has very low skew with no long tail on either end
Skewness: 0.14
Outliers: Age data has no outliers.

Figure 1-1 Age Data – Boxplot and Histogram

Age Groups Count of Voters in Each Age Group % of Voters


(0, 25] 15 0.99
(25, 45] 486 32.04
(45, 65] 588 38.76
(65, 100] 428 28.21
Table 1-5 Count of Voters Across Age Voters

We see that 0-25 voters’ makeup the


smallest voter demographic. This is
understandable as the voting age is
generally 18+.
The rest of the voters are all evenly
distributed across the other age
groups with max voters lying in the 45-
65 age group.

Figure 1-2 Count of Voters Across Age Voters

4
Machine Learning - Project

Numerical Categorical (Ordinal) Data


Although the below columns hold numeric data type, the values are not continuous i.e. they are
numbers whose values do not have any mathematical meaning i.e. they are ordinal in nature
holding a scaled gradient.

• National Economic Cond (NEC)


• Household Economic Cond (HEC)
• Blair
• Hague
• Europe
• Political Knowledge'
Due to this, histogram is not plotted for these numerical data columns. In addition, there is no
concept of skew here as well.
National Economic Condition
The boxplot shows the presence of outliers lying below the bottom (left) whisker. This is evident
as the number of persons who assessed the NEC as 1 make up a very small percentage of the
total.

Figure 1-3 NEC Data – Boxplot and Count Plot

• Largest percent of the population scored NEC as 3 or 4. This shows that majority (~75% of
total) of the survey demographic hold a moderate view of the NEC.
• Fewest members of the survey demographic scored NEC as 1.

NEC Assessment Values Count Percentage


1 37 2.44
2 256 26.87
3 604 39.81
4 538 35.46
5 82 5.40
Table 1-6 NEC Assessment Data Distribution

5
Machine Learning - Project

Household Economic Condition


The boxplot shows the presence of outliers lying below the bottom (left) whisker. As the number
of persons gave a score of 1 for HEC as compared to higher scores by others.

Figure 1-4 HEC Data – Boxplot and Count Plot

• Largest percent of the population scored HEC as 3 followed by a score of 4. This shows
that majority of the survey demographic (~70%) hold a moderate view of the HEC
• Fewest members of the survey demographic hold the poorest assessment of HEC

HEC Assessment Values Count Percentage


1 65 4.28
2 280 18.45
3 645 42.51
4 445 28.67
5 92 6.06
Table 1-7 HEC Assessment Data Distribution

Blair
The boxplot shows no outliers.

Figure 1-5 HEC Data – Boxplot and Count Plot

6
Machine Learning - Project

• Largest percent of the population (~65) scored Blair (labor leader) as 4 and 5. This shows
the labor party has a good standing

Blair Assessment Values Count Percentage


1 97 6.39
2 434 28.60
3 1 0.065
4 833 54.91
5 152 10.01
Table 1-8 Blair Assessment Data Distribution

Hague
The boxplot shows no outliers.

Figure 1-6 Hague Assessment Data – Boxplot and Count Plot

• The number of people who scored the conservative leadership as 2 or 4 are close in
number (40.92 for score of 2 vs 36.59 for a score of 4). This indicates that there is a wide
divide between the opinions of the people here
• A higher percent of people (~55%) have a poor assessment of Hague, they have scored 1
or 2 (seen in table below).

Hague Assessment Values Count Percentage


1 233 15.35
2 617 40.67
3 37 2.43
4 557 36.71
5 73 4.81
Table 1-9 Hague Assessment Data Distribution

Europe
The boxplot shows no outliers. The median value of the distribution is 6.

7
Machine Learning - Project

Figure 1-7 Europe Data – Boxplot and Count Plot

• ~22% of people have high skepticism where Europe is concerned. This is the most popular
score.
• If we group 1-5 as positive towards Europe, we have around 37% of the demographic.
• Taking a score of 6 as neutral, we keep score of 7-11 as negative. 49% of survey takers fall
in this sentiment bracket.

Europe Sentiment Values Count Percentage


1 109 7.1852
2 77 5.0758
3 128 8.4377
4 126 8.3059
5 123 8.1081
6 207 13.6454
7 86 5.6691
8 111 7.3171
9 111 7.3171
10 101 6.6579
11 338 22.2808
Table 1-10 Europe Sentiment Score Distribution

Political Knowledge
The boxplot shows no outliers.

8
Machine Learning - Project

Figure 1-8 PK Data – Boxplot and Count Plot

• More than 65% of the population have a PK sore of 2 or higher which indicates good
general knowhow of the political situation.
• ~29% of people have a PK score of 0 which is a cause for concern.

Political Knowledge Score Count Percentage


0 454 29.92
1 38 2.50
2 776 51.15
3 249 16.41
Table 1-11 PK Assessment Data Distribution

Gender
This column holds the gender distribution of the
survey takers. There is a higher number of
females as compared to males.

Figure 1-9 Gender Count Plot

Gender Count Percentage


Female 808 53.26
Male 709 46.73
Table 1-12 Gender Data Distribution

9
Machine Learning - Project

Vote
This column holds the distribution of votes cast
by the demographic. This is the target data of
our model generation i.e. our goal is for our
model to correctly predict if a citizen shall vote
for the Labor or the Conservative party.
In this data it is clear that Labor has a clear
majority of votes. This shows that there is a class
imbalance in the target variable.

Figure 1-10 Vote Count Plot

Vote Cast Count Percentage


Labor 1062 69.79
Conservative` 462 30.29
Table 1-13 Vote Data Distribution

Age Distributions Across Other Features


We analyze how each age group has scored the various input variables. The insights from this will
be crucial when deciding political polices targeting specific voter gage groups.
1.2.1.11.1 Age + NEC/HEC Relationship
The below plot show the distribution of NEC and HEC scores across each age group.

NEC HEC
Age Group
Popular Score Unpopular Score Popular Score Unpopular Score
0, 25 4 1/5 4 1
25, 45 3 1 4 1
45, 65 3 1 3 5
65, 100 3 1 3 1
Table 1-14 NEC & HEC Across Age Groups

1. HEC and NEC scores follow a similar pattern across all age groups i.e. the popular and the
unpopular scores for both HEC and NEC are around the same. This indicates that the
voters associate their household economic conditions with that of the nation.
2. The most popular scores are 3 or 4 for both HEC and NEC.
3. The least popular scores are mostly 1. This shows that the voters have a generally positive
attitude towards the NEC and HEC

10
Machine Learning - Project

Figure 1-11 Voter Age Groups – NEC and HEC Scoring

1.2.1.11.2 Age + Blair/Hague Relationship


The below plot show the distribution of Blair and Hague scores across each age group.

Blair Hague
Age Group
Popular Score Unpopular Score Popular Score Unpopular Score
0, 25 4 5 1 3/5
25, 45 4 1 2 3
45, 65 4 1 2 3
65, 100 4 1 2/4 3
Table 1-15 Blair & Hague Across Age Groups

1. Blair and Hague scores do not follow a similar pattern.


2. All age groups consistently prefer Blair over Hague.

11
Machine Learning - Project

3. The 60-100 age group have similar number of voters who have scored Hague 2 and 4 which shows that, that age group is
divided in its opinion.
4. The least popular scores for Blair are mostly 1 showing strong support.
5. Least number of scores for both Blair and Hague is 3. This shows all voters have strong opinions and do not prefer to score
moderately.

Figure 1-12 Voter Age Groups – Blair-Hague Scoring

1.2.1.11.3 Age + PK & Europe Relationship


The below plot show the distribution of Political Knowledge and Europe Sentiment scores across each age group.
As Europe sentiment has 0-11 range, it has been broken up as [0-2]: Low, [3-5]: Low-Mid, [6-8]: High Mid and [9-11]: High. This grouping
can give help analysis by reducing the number of options.

Age Group Political Knowledge Europe Sentiment

12
Machine Learning - Project

Popular Score Unpopular Score Popular Score Unpopular Score


0, 25 0 3 Low-Mid Low
25, 45 2 1 Mid-High Low
45, 65 2 1 High Low
65, 100 2 1 High Low
Table 1-16 Europe Scoring & Political Knowledge Across Age Groups

1. Mostly all age groups claim political knowledge on the higher end. This claim can only be substantiated by political knowledge
quiz/enquiry.
2. Least number of voters claim low political knowledge.
3. The popular scores for Europe sentiment is varied across age groups but the unpopular score is consistently Low.

Figure 1-13 Voter Age Groups – Europe Scoring & Political Knowledge

13
Machine Learning - Project

1.2.1.11.4 Age + Gender Relationship


The below plot show the distribution of gender across each age group. There are more
females than males across all age groups. The reason for this should be checked. It may be
the location or time when the consensus was taken.

Figure 1-14 Voter Age Groups – Gender

Vote Distributions Across Other Features


Vote is the target variable i.e. the data we are trying to predict. Checking the behavior of the
parameters with respect to vote can provide valuable insights.
1.2.1.12.1 Age + Vote Relationship
The below plot show the distribution of vote across each age group. All groups have clearly
chosen Labour as their political party of choice.

Figure 1-15 Vote Decision- Age Groups

14
Machine Learning - Project

1.2.1.12.2 Vote + NEC-HEC Scores


The insights party preference of the voters and their scoring for the National EC and Household
EC has been plotted and the insights have been documented.
1. Voters who have chosen Labour have mainly scored NEC and HEC 3 or 4. This indicates
that they have a mid to high level of satisfaction with the economic conditions.
2. Voters who have chosen Conservative have mainly scored NEC and HEC 2 or 3. This
indicates that they have a low to mid level of satisfaction with the economic conditions.
3. In NEC=3 and NEC=4 scores and HEC=3 and HEC=4 scores, the labor vote is more than
double of that of conservative party.

Figure 1-16 Vote Decision- HEC and NEC Scores

1.2.1.12.3 Vote + Blair-Hague Scores


The insights party preference of the voters and their scoring for the Blair and Hague has been
plotted and the insights have been documented.
1. Voters who have chosen Labour have high assessment levels of Blair. They also have a
generally low assessment of Hague.
2. In similar fashion, Conservative voters have a high assessment of Hague and a lower
assessment of Blair.

15
Machine Learning - Project

Figure 1-17 Vote Decision- Blair-Hague Scores

1.2.1.12.4 Vote + Political Knowledge Scores


The insights party preference of the voters and their scoring for Political Knowledge has been
plotted and the insights have been documented.
1. A large number of Labour voters have rated their political knowledge as 0.
This indicates that they may
be voting Labour due to
loyalty (in case Labour is
currently in power) or due to
general discontent (if
conservative is currently in
power)
2. A majority of both
party voters claim to have
rather high political score.

Figure 1-18 Vote Decision- Political


Knowledge Scores

1.2.1.12.5 Vote + Europe Sentiment Scores


The insights party preference of the voters and their scoring of Europe Sentiment C has been
plotted and the insights have been documented.

16
Machine Learning - Project

Figure 1-19 Vote Decision- Europe Sentiment Scores

1. A clear majority of Conservative voters have given the maximum score for Europe
Sentiment (indicating negativity).
2. In comparison, although a large number of Labour voters have given the highest score of
11 for Europe sentiment (indicating negativity), a greater number of people have scored
it 6 which indicates moderate sentiment.
3. But overall, Labour voters have a scored on the lower end showing a positive sentiment
for Europe.
1.2.1.12.6 Vote + Gender Relationship
The insights party preference of the voters and their gender has been plotted and the insights
have been documented.
Following the general
vote trend seen in
1.2.1.10 Vote more
voters have selected
Labour irrespective
of their gender

Figure 1-20 Vote Decision-


Gender

17
Machine Learning - Project

1.2.2 Bivariate Analysis


The relationship between the different numerical columns of the dataset can be visualized with
a pair plot. In addition, a heat map of the correlations also lets us understand the degree of
correlation between the data columns. Both the pair plot and the heat map for all the parameters
have been constructed and placed below.
The pair plot does not show any discernable patterns between any of the parameters.

Figure 1-21 Voters – Pair Plot

18
Machine Learning - Project

Supporting the inference from the pair plot, it is clearly seen that there is low to no degree of
correlation between any of the data parameters.

Figure 1-22 Voters – Numerical Data Heat Map

19
Machine Learning - Project

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here
or not? Data Split: Split the data into train and test (70:30). (4 marks)
1.3.1 Encoding of Categorical Data
Models are developed on numerical data. With this predicate, we convert the unique values in
the categorical columns into numerical values. This converted data is used in the modelling
operations.
The voter dataset has 2 object variables. We convert these columns to be numerical in nature
before using it to build the prediction model.

• As these are not ordinal in nature, we can use one-hot-encoding process. This ensures
that the machine learning model does not assume that higher values are more important.
• But as in this case, each object data has only two values (Male/Female and
Labour/Conservative) we can go ahead and do a simple replacement or categorical codes
conversion
After completion of encoding, the below values are replaced within the dataset.

Gender Vote
Object Value Numerical Value Object Value Numerical Value
Male 0 Conservative 0
Female 1 Labour 1
Table 1-17 Categorical Values to Numerical Number Codes

1.3.2 Spilt of Data – Train and Test


The train-test split operation is performed for classification or regression problems before the
model is built and trained. It is used for supervised learning algorithms. In this operation, the
dataset is randomly divided it into two subsets.
1. Subset1 is used to train the model. This is named as the training dataset.
2. Subset2 is used to test the created model and so is named the test dataset.
3. The model is trained with training dataset which includes the independent data variables
and the associated target data.
4. The trained model is then given the test dataset’s independent variables as input. The
model shall generate the predictions as the output.
5. The predictions made by the model are then compared against the expected values i.e.
the target data of the test dataset.
The comparison between the actual target values and the model predicted target values is used
to evaluate the model performance.
There are two main configuration parameters that are used to create the training and test data
subsets.

20
Machine Learning - Project

• test_size: This is the size of test set. It is expressed as a percentage between 0 and 1 for
either the train or test datasets. The specified percent of data will be collected into the
test subset.
To give an example, if an example dataset has a total of 1000 rows.
On specifying test_size=0.3, the resultant test subset has 300 data points (30% of 1000)
and the training subset shall hold 700 entries (70% of 1000).
• random_state: This input is used to initialize the internal random number generator,
which decides how to split the data into train and test subsets.
This input should be set to the same value, if the same consistency is to be expected over
multiple runs of the code.
The splitting of the data is done using the test_train_split function from the python module
sklearn.

1. For the voter dataset, the execution of the split operation is performed with the inputs
random_state=1, test_size=0.3
2. The test and train data subset of independent input variables have their content as
follows:
Training subset = 1061 rows, 8 columns
Test subset = 456 rows, 8 columns

3. The test and train subset target variable details as given from the test_train_split
operation is as below:
Training subset target data = 1061 rows, 1 column
Test subset target data = 456 rows, 1 column

1.3.3 Scaling Necessity


What is Scaling
Scaling is a preprocessing step which is applied to the independent variables in order to normalize
the data within a particular range. Most datasets have features which vary highly in magnitudes,
units and range like dollars, age, probability etc.
Why is Scaling Needed
• Scaling ensures that all the features are given equal importance i.e. having features on
same scale ensures that they all contribute equally to the result.
• If scaling is not performed, the large scale variables may dominate the small scale
features. This will result in a poor model.
• Certain models operate on model convergence. These models’ computation time will be
benefitted when the data has been scaled.

21
Machine Learning - Project

When is Scaling Done


• Certain machine learning algorithms such as distance based algorithms like KNN and
Gradient Descent Based algorithms like Logistic Regression are sensitive towards feature
scaling and some benefit can be seen by scaling the data.
• ML models such as LDA and Naïve Bayes do not follow the same behavior and so there is
no added benefit in scaling the data.
Voter Dataset – Scaling Decision
In the case of the voter dataset, we see that some of the features are holding disproportionate
data magnitudes and units.
1. Age varies from 24 to 93
2. Some scoring scales have 1-5 range, others are 0-3 and 1-11
3. Inputs like gender are binary and hold only 0 or 1
With the presence of these varied data magnitudes/units, it is advisable to perform the scaling
on the dataset.
Voter Dataset – Scaling Process
In this case, we will be scaling the data by computing the min-max score value for all the values.
The formula is displayed below. In min-max scaling the minimum value of a feature gets
transformed into a 0 and the maximum value gets transformed into a 1. Every other value gets
transformed into a decimal between 0 and 1. This scaling operation is performed using the
MinMaxScaler function in Python.

𝑥 − min (𝑥)
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
max(𝑥 ) − min (x)
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑆𝑐𝑎𝑙𝑒𝑑 𝑉𝑎𝑙𝑢𝑒 𝑜𝑓 𝑥
𝑥 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑉𝑎𝑙𝑢𝑒
min(𝑥) = 𝑀𝑖𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑥
max(𝑥) = 𝑀𝑎𝑥 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑥
Formula 1-1 Min-Max Calculation

Voter Dataset - Before and After Scaling


A plot of the data distributions for all data before and after the scaling process are as below. It is
evident that before scaling, the data ranges are different as there is no overlap of the data plots.
After scaling the data ranges for all the data is normalized within 0 and 1.

22
Machine Learning - Project

Figure 1-23 Range of Data Distribution all Columns

Figure 1-24 Range of Data Distribution all Columns After Scaling

We check the standard deviation ranges of the data before and after the scaling operation. It is
evident that the difference in ranges has reduced post scaling.

Columns STD of Unscaled Data STD of Scaled Data


Age 15.7 0.225
National Economic Cond. 0.88 0.214
Household Economic Cond. 0.93 0.235
Blair 1.17 0.292
Hague 1.23 0.308
Europe 3.3 0.326
Political Knowledge 1.08 0.359
Gender 15.7 0.499
Table 1-18 Before and After Scaling - STD Comparison

23
Machine Learning - Project

A sample of the dataset post scaling is shown below. Only the independent features are displayed
below.
National
Household Political
Age Economic Blair Hague Europe Gender
Economic Cond. Knowledge
Cond.
0 0.14 0.25 0.75 0 0.75 1 0.67 1
1 0.23 0.75 0.5 0.75 0.75 0.5 0 0
2 0.54 0.75 0.5 0.75 0.75 0.6 0.67 1
3 0.33 0.5 0.5 0.75 0.25 1 0 0
4 0.29 1 0.5 0.75 0.25 0.7 0 0
Table 1-19 Scaled Voter Data

1.3.4 Data Balancing Issue - SMOTE


In the case of the voter data set, we see that the target data is imbalanced. 70% of the data
entries point to Labour part and only 30% are that of Conservative party.
Models prepared on balanced datasets perform better with better accuracy. In order to mitigate
this problem, we apply SMOTE-Synthetic Minority Over-Sampling Technique. In essence, it
oversamples the data and creates a more robust training dataset that can be used for the model
training. Here we use the SMOTE function from the imblearn.over_sampling library.
The function is applied on the training data only. The only parameter passed to this function is
random_state. This input is used to initialize the internal random number generator, which
decides how to split the data into train and test subsets.
This input should be set to the same value, if the same consistency is to be expected over multiple
runs of the code. On completion of execution of SMOTE, below is the output seen. We see that
the training dataset has been padded to get a balanced dataset.

Training Data Analysis Before SMOTE After SMOTE


Number of Rows in Training Data 1061 1508
Number of Conservative Target Variables 307 754
Number of Labour Target Variables 754 754
Ratio in Target Variable 29:71 50:50
Table 1-20 Before and After SMOTE

1. Use un balanced data to train Model1


2. Use SMOTE balanced data to train Model2
3. Check its Model1 and Model2 scores and compare them
4. The inputs of the superior model (unbalanced data or SMOTE balanced data will be used
as inputs for generation of Model3 which has hyper-parameters tuned.

24
Machine Learning - Project

1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)
1.4.1 Logistic Regression Model
Logistic regression is a supervised learning technique for a binary response. The two response
classes are Positive-Negative; the output is given as the probability of positive based on the
values of the predictors
Log. Reg. Model Step1- Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
Log. Reg. Model Step2- Model Build
The logistic regression model is constructed using the function LogisticRegression from the
sklearn.linear_model library. Arguments passed for this functions are as below:

• solver=’newton-cg’
This is the algorithm to use in the optimization problem.
• max_iter=10000
10,000 is the maximum number of iterations for the solvers to converge.
• penalty = none
No penalty is added to the model
• tol=0.0001
This is the tolerance value for the stopping criteria.
• verbose=True
Setting this to true allows the progress messages to be printed out
• random_state=1
This makes the model’s output replicable. The model will always produce the same results
when it has a definite value of random_state and if it has been given the same
parameters and the same training data.
The mean accuracy of the model built using the scaled-unbalanced data and the scaled-
balanced datasets are as follows.

Trained with Scaled Trained with Scaled Balanced


Unbalanced Data Data
Train Data Score 83.12 83.35
Test Data Score 83.5 80.70
Used for Model Tuning YES NO
Table 1-21 Log. Reg. Model Scores

Log. Reg. Model Performance


1. The scores for the train and test dataset are similar indicating that the generated models
are not over-fitted.

25
Machine Learning - Project

2. The score of the models are good for both the test and training datasets indicating good
model performance for the prediction of the output.
3. We shall use the scaled-unbalanced data to train the tuned model with hyper-parameters.

1.4.2 Linear Discriminant Analysis Model


Linear Discriminant Analysis (LDA) is a supervised method used for classifying observations to a
class or category based on predictor (independent) variables of the data. We use the accuracy
score of the model to check its quality and performance.
LDA Model Step1- Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
LDA Model Step2- Model Build
The logistic regression model is constructed using the function LinearDiscriminantAnalysis
from the sklearn. discriminant_analysis library. We use default arguments for this model
generation. The main default arguments are:

• solver=’svd’
This is the algorithm to use in the optimization problem.
• tol=0.0001
This is the tolerance value for the stopping criteria.
The mean accuracy of the model built using the scaled-unbalanced data and the scaled-
balanced datasets are as follows.

Trained with Unscaled Trained with Scaled Trained with Scaled


Unbalanced Data Unbalanced Data Balanced Data
Train Data Score 83.41 83.41 83.68
Test Data Score 83.33 83.33 81.14
Used for Model
NO YES NO
Tuning
Table 1-22 LDA Model Scores

LDA Model Performance


1. The scores for the train and test dataset are similar indicating that the generated model
is not over-fitted.
2. The score of the model is good for both the test and training datasets indicating good
model performance for the prediction of the output.
3. The scores for the model trained with scaled and unscaled data is the same,
4. We shall use the scaled-unbalanced data to train the tuned model with hyper-
parameters.

26
Machine Learning - Project

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)
1.5.1 KNN Model
K-Nearest Neighbors or KNN is a supervised learning technique used for classification and
regression. It considers the K nearest data points (neighbors) to predict the class or continuous
value.
KNN Model Step1- Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
KNN Model Step2- Model Build
The logistic regression model is constructed using the function KNeighborsClassifier from the
sklearn.neighbors library. Arguments passed for this functions are as below:

• n_neighbors=5
This is the number of neighbors that is used by default i.e. the k value.
• weights= “uniform”
This is the default weight function that is used in prediction.
• algorithm= “auto”
The algorithm that is used to compute the nearest neighbors. “auto” option decides the
most appropriate algorithm based on input values.
The mean accuracy of the model built using the scaled-unbalanced data and the scaled-
balanced datasets are as follows.

Trained with Scaled Trained with Scaled Balanced


Unbalanced Data Data
Train Data Score 85.67 88.52
Test Data Score 82.01 80.04
Used for Model Tuning YES NO
Table 1-23 KNN Model Scores

KNN Model Performance


1. The scores for the train and test dataset are similar indicating that the generated model
is not over-fitted.
2. The score of the model is good for both the test and training datasets indicating good
model performance for the prediction of the output.
3. We shall use the scaled-unbalanced data to train the tuned model with hyper-parameters.

27
Machine Learning - Project

1.5.2 Naïve Bayes Analysis Model


Naïve Bayes is a classification algorithm based on the Bayes theorem. Naïve Bayes works on the
assumption that all features are equal and independent. It is easy and fast and can be used for
multiclass prediction.
Naïve Bayes Model Step1- Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test. In this case, we are using the unscaled dataset to train the
model and derive the predictions. The training dataset used to train the model has its target
variable balanced using SMOTE (section 1.3.4 Data Balancing Issue).
Naïve Bayes Model Step2- Model Build
The logistic regression model is constructed using the function GaussianNB from the
sklearn.naive_bayes library. We use default arguments for this model generation.

• var_smoothing = 1e-9
Portion of the largest variance of all features that is added to variances for calculation
stability.
The mean accuracy of the model built using the scaled-unbalanced data and the scaled-
balanced datasets are as follows.

Trained with Unscaled Trained with Scaled Trained with Scaled


Unbalanced Data Unbalanced Data Balanced Data
Train Data Score 83.50 83.50 83.28
Test Data Score 82.23 82.23 80.70
Used for Model
NO YES NO
Tuning
Table 1-24 Naïve Bayes Model Scores

Naïve Bayes Model Performance


1. The scores for the train and test dataset are similar indicating that the generated model
is not over-fitted.
2. The score of the model is good for both the test and training datasets indicating good
model performance for the prediction of the output.
3. The scores for the model trained with scaled and unscaled data is the same.
4. We shall use the scaled-unbalanced data to train the tuned model with hyper-parameters.

28
Machine Learning - Project

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and
Boosting. (7 marks)
1.6.1 Bagging Models
Bagging Classifier Model
A Bagging classifier is an ensemble technique that fits base classifiers each on random subsets of
the original dataset. It then aggregates their individual predictions (either by voting or by
averaging) to form a final prediction.
1.6.1.1.1 Bagging Classifier Model Step1 – Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
1.6.1.1.2 Bagging Classifier Model Step2 –Model Build
The classifier is constructed using the BaggingClassifier function that is part of the library -
sklearn.ensemble. Arguments passed for this functions are as below:

• base_estimator = Decision Tree Classifier object'


The base estimator to fit on random subsets of the dataset. The default value is a Decision
Tree Classifier.
• n_estimators = 100
The number of estimators in the ensemble. In general, a higher number makes the
predictions more stable, but it also slows down the computation.
• random_state=1
This makes the model’s output replicable. The model will always produce the same results
when it has a definite value of random_state and if it has been given the same
parameters and the same training data.
The constructed model is then used to fit the training dataset in order to complete the model
training operation. The mean accuracy of the model built using the scaled-unbalanced data and
the scaled-balanced datasets are as follows.

Trained with Unscaled Trained with Scaled Trained with Scaled


Unbalanced Data Unbalanced Data Balanced Data
Train Data Score 100.0 100.0 100.0
Test Data Score 82.02 81.79 81.35
Used for Model
YES YES NO
Tuning
Table 1-25 Bagging Model Scores

1.6.1.1.3 Bagging Classifier Model Performance


1. The scores for the train dataset is 100%. Although the score for the test dataset is high as
well, seeing a 100% score indicates that the model has been over-fitted.

29
Machine Learning - Project

2. The score of the model is good for both the test and training datasets indicating good
model performance for the prediction of the output.
3. The model trained with unscaled unbalanced data has performed better than the other
models.
4. We shall use the unscaled-unbalanced data to train the tuned model with hyper-
parameters.

Random Forest Bagging Classifier


Random forest is a supervised learning algorithm that can be used for both classification and
regression tasks. The "forest" that is built is an ensemble of decision trees, usually trained with
the “bagging” method. The idea of the bagging method is that a combination of learning models
increases the overall result. To condense the ideas, a random forest builds multiple decision trees
and merges them together to get a more accurate and stable prediction.
Random forest is an extension of the bagging that also randomly selects subsets of features used
in each data sample.
1.6.1.2.1 RF Modelling Step1 – Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
1.6.1.2.2 RF Modelling Step2 –Model Build
The random forest is constructed using the function RandomForestClassifier from the
sklearn.ensemble library. Arguments passed for this functions are as below:

• criterion = 'gini'
The function to measure the quality of a split. Gini’ is the default value of the criterion
argument so it does not have to be explicitly specified.
• n_estimators = 500
The number of trees that are in the forest. maximum depth of the tree. In general, a
higher number of trees increases the performance and makes the predictions more
stable, but it also slows down the computation.
• oob_score=True
This value decides whether to use out-of-bag samples to estimate the generalization
score.
• max_features=5
The number of features to consider when looking for the best split.
• max_depth = 8
The maximum depth of the tree. If no input is specified, then nodes are expanded until
all leaves are pure or until all leaves contain less than “min_samples_split” samples.
• min_samples_leaf=20

30
Machine Learning - Project

The minimum number of samples required to be at a leaf node. A split point at any depth
will only be considered if it leaves at least “min_samples_leaf” training samples in each
of the left and right branches.
Generally, this value will be 1% to 3% of the total number of data points.
• min_samples_split=60
The minimum number of samples required to split an internal node.
Generally, this value will be three times the value set to “min_samples_leaf.
• random_state=1
This makes the model’s output replicable. The model will always produce the same results
when it has a definite value of random_state and if it has been given the same
parameters and the same training data.
The constructed model is then used to fit the training dataset in order to complete the model
training operation.
1.6.1.2.3 RF Modelling Step3 – Checking Features’ Importance and OOB Score
The generated model gives the importance of each of the features that will impact the output.
Out of bag (OOB) score is a way of validating the Random forest model. It is computed as the
number of correctly predicted rows from the out-of-bag sample. Both these data are tabulated
below.
We can see that Blair, Hague and Europe features hold the most influence as compared to the
other features. The order of importance of Blair and Hague are interchanged between the two
tabulated outputs.

Trained with Scaled Unbalanced Data


Trained with Scaled Balanced Data
Trained with Unscaled Unbalanced Data
Columns Importance Columns Importance
Hague 0.3611 Hague 0.3681
Blair 0.2586 Blair 0.2775
Europe 0.2067 Europe 0.1806
National Economic Cond. 0.0636 National Economic Cond. 0.0915
Political Knowledge 0.0586 Political Knowledge 0.0348
Age 0.0363 Age 0.0322
Household Economic Cond. 0.0124 Household Economic Cond. 0.0131
Gender 0.0038 Gender 0.0021

OOB Score 0.8294 OOB Score 0.8527


Table 1-26 RF-Computed Importance for All Features

The accuracy of the model built using the scaled-unbalanced data and the scaled-balanced
datasets are as follows.

31
Machine Learning - Project

Trained with Unscaled Trained with Scaled Trained with Scaled


Unbalanced Data Unbalanced Data Balanced Data
Train Data Score 84.63 84.63 86.47
Test Data Score 82.45 82.45 79.16
Used for Model
NO YES NO
Tuning
Table 1-27 RF Bagging Model Scores

1.6.1.2.4 RF Model Performance


1. The scores for the train and test dataset are similar indicating that the generated model
is not over-fitted.
2. The score of the model is low for both the test and training datasets indicating poor model
performance for the prediction of the output.
3. The scores for the model trained with scaled and unscaled data is the same.
4. We shall use the scaled-unbalanced data to train the tuned model with hyper-parameters.

1.6.2 Boosting Models


Boosting is an ensemble modeling technique that attempts to build a strong classifier from a
number of weak classifiers. It uses an iterative process of model building.
An initial model is built from the training data. Following this, the second model is built in which
the errors present in the first model are corrected. This process is continued and models are
added until either the complete training data set is predicted correctly or the maximum number
of models are added.
Ada Boost
Adaptive Boosting algorithm - AdaBoost, is a boosting technique used as an ensemble method in
ML. It is called adaptive boosting as the weights are re-assigned to each instance, with higher
weights assigned to incorrectly classified instances.
It works on the principle of learners growing sequentially i.e. except for the first learner, each
subsequent learner is grown from previously grown learners.
1.6.2.1.1 Ada Boost Model Step1 – Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
1.6.2.1.2 Ada Boost Modelling Step2 –Model Build
The random forest is constructed using the function AdaBoostClassifier from the
sklearn.ensemble library. Arguments passed for this functions are as below:

• n_estimators = 500
The number of boosting stages to perform.

32
Machine Learning - Project

• random_state=1
Controls the random seed given at each `base_estimator` at each boosting iteration.
The constructed model is then used to fit the training dataset in order to complete the model
training operation. The mean accuracy of the model built using the scaled-unbalanced data and
the scaled-balanced datasets are as follows.

Trained with Unscaled Trained with Scaled Trained with Scaled


Unbalanced Data Unbalanced Data Balanced Data
Train Data Score 85.76 85.76 89.32
Test Data Score 80.92 80.92 81.79
Used for Model
NO NO YES
Tuning
Table 1-28 Ada Boost Model Scores

1.6.2.1.3 Ada Boost Model Performance


1. The scores for the train and test dataset are similar indicating that the generated model
is not over-fitted.
2. The score of the model is low for both the test and training datasets indicating poor model
performance for the prediction of the output.
3. The model trained with scaled balanced data has performed better than the other
models.
4. We shall use the scaled-balanced data to train the tuned model with hyper-parameters.

Gradient Boosting Model


Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of
a differentiable function. As gradient boosting is based on minimizing a loss function, different
types of loss functions can be used resulting in a flexible technique that can be applied to
regression, multi-class classification, etc.
Gradient boosting is a stage-wise additive model that generates learners during the learning
process. The contribution of the weak learner to the ensemble is based on the gradient descent
optimization process. The calculated contribution of each tree is based on minimizing the overall
error of the strong learner.
1.6.2.2.1 Gradient Boost Model Step1 – Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
1.6.2.2.2 Gradient Boost Modelling Step2 –Model Build
The random forest is constructed using the function GradientBoostingClassifier from the
sklearn.ensemble library. Arguments passed for this functions are as below:

33
Machine Learning - Project

• n_estimators = 500
The number of trees that are in the forest. maximum depth of the tree. In general, a
higher number of trees increases the performance and makes the predictions more
stable, but it also slows down the computation.
• random_state=1
Controls the random seed given to each Tree estimator at each boosting iteration.
The constructed model is then used to fit the training dataset in order to complete the model
training operation. The accuracy of the model built using the scaled-unbalanced data and the
scaled-balanced datasets are as follows.

Trained with Unscaled Trained with Scaled Trained with Scaled


Unbalanced Data Unbalanced Data Balanced Data
Train Data Score 95.57 95.57 96.94
Test Data Score 82.23 82.23 82.67
Used for Model
NO NO YES
Tuning
Table 1-29 Gradient Boost Model Scores

1.6.2.2.3 Gradient Boost Model Performance


1. The scores for the train and test dataset are similar indicating that the generated model
is not over-fitted.
2. The score of the model is low for both the test and training datasets indicating poor model
performance for the prediction of the output.
3. The model trained with scaled balanced data has performed better than the other
models.
4. We shall use the scaled-balanced data to train the tuned model with hyper-parameters.

1.6.3 Model Tuning


We use the GridSearchCV function of the sklearn.model_selection module to identify the
best possible combinations of inputs to generate a better model.
We give the following combinations of inputs as the grid along with the modelling algorithm as
input to the GridSearchCV function It exhaustively generates candidates from the grid of
parameter values specified and the best inputs for that algorithm are then selected.
Data Used for Tuned Model Training
For the above model types, we have trained models using the below three types of data, namely:

• Unscaled-Unbalanced data
• Scaled-Unbalanced Data
• Scaled-Balanced Data

34
Machine Learning - Project

Based on the scores for this data, the data producing the best results is selected and then used
for building a tuned mode.

Model Name Data Used to Train Tuned Model


Logistic Regression Model Scaled Unbalanced Data
LDA Model Scaled Unbalanced Data
KNN Model Scaled Unbalanced Data
Naïve Bayes Model Scaled Unbalanced Data
Bagging Classifier Model Unscaled Unbalanced Data
Random Forest Model Scaled Unbalanced Data
Ada Boost Model Scaled Balanced Data
Gradient Boost Model Scaled Balanced Data
Table 1-30 Data Used to Train Tuned Models

Log. Reg. Model - Model Tuning (Grid Search)


Below is the parameter grid which is given as the input for the Logistic Regression Model.
param_grid = {
'penalty':['l2','none','l1','elasticnet'],
'solver':['sag','lbfgs','saga','newton-cg','liblinear'],
'tol':[0.001,0.0001,0.00001],
'l1_ratio':[0.25,0.5,0.75],
'max_iter':[100,1000,10000]}

After the GridSearchCV function execution is complete, below is the set of best selected
parameters for the model.
{'l1_ratio': 0.25,
'max_iter': 10000,
'penalty': 'l1',
'solver': 'saga',
'tol': 0.01}

1.6.3.2.1 Log. Reg. Modelling: Simple Model Vs Tuned Model


We compare these results with the simple Logistic Regression model that was previously created
in 1.4.1 Logistic Regression Model.
1. The scores for the train data has remained the almost the same.
2. The score for the test dataset has improved. Although the improvement is small, it
shows that the GridSearchCV operation gave positive results in this case.

Model Dataset Accuracy Score


Train Dataset 83.12
Simple Logistic Regression Model
Test Dataset 83.55
Tuned Logistic Regression Model Train Dataset 83.12

35
Machine Learning - Project

Test Dataset 82.89


Table 1-31 Log. Reg.-Simple Vs Tuned Model Comparison

Conclusion: The tuned model has performed slightly better than the simple default model.

LDA. Model - Model Tuning (Grid Search)


Below is the parameter grid which is given as the input for the LDA Model.
param_grid = {
'solver':[ 'svd', 'lsqr', 'eigen'],
'tol':[0.001,0.0001,0.00001]}

After the GridSearchCV function execution is complete, below is the set of best selected
parameters for the model.
{'solver': 'svd',
'tol': 1e-05}

1.6.3.3.1 LDA Modelling: Simple Model Vs Tuned Model


We compare these results with the LDA model that was previously created in 1.4.2 Linear
Discriminant Analysis Model.
1. The score for the train data has improved slightly and the score for the test data has
reduced slightly.
2. In this case, the model tuning operation did not generate any useful results.

Model Dataset Accuracy Score


Train Dataset 83.41
Simple LDA Model
Test Dataset 83.33
Train Dataset 83.41
Tuned LDA Model
Test Dataset 83.33
Table 1-32 LDA-Simple Vs Tuned Model Comparison

Conclusion: The tuned model has not given better results as compared to the simple model.

KNN - Model Tuning (Neighbor-K Search)


In order to obtain the best model for the KNeighborsClassifier model, we have to identify the
best value of K which gives the best score with misclassification error (MCE).
We compute the scores and MCE for odd vales from 1 to 19. These values are as below.

Neighbor K Train Data Score Test Data Score MCE


1.0000 0.7566 0.2434

36
Machine Learning - Project

1 0.8822 0.7982 0.2018


3 0.8567 0.8202 0.1798
5 0.8417 0.8399 0.1601
7 0.8445 0.8421 0.1579
9 0.8445 0.8333 0.1667
11 0.8435 0.8377 0.1623
13 0.8379 0.8421 0.1579
15 0.8313 0.8377 0.1623
17 0.8351 0.8355 0.1645
19 1.0000 0.7566 0.2434
Table 1-33 KNN – Scores for Different N

The MCE is plotted against


the N neighbors count.

Figure 1-25 MCE-K Neighbors Plot

As per the above tables and the plot, we deduce that the optimum value for K is 9. We build a
model with the K=9.
1.6.3.4.1 KNN Model - Simple Model Vs Tuned Model
We compare these results with the KNN model that was previously created in 1.5.1 KNN Model.
1. The scores for the train data deteriorated a little but the score for the test dataset has
improved.
2. The GridSearchCV operation have given positive results.

Model Dataset Accuracy Score


Train Dataset 85.67
Simple KNN Model
Test Dataset 82.01
Train Dataset 84.44
Tuned KNN Model
Test Dataset 84.21

37
Machine Learning - Project

Table 1-34 KNN-Simple Vs Tuned Model Comparison

Conclusion: The tuned model has performed better than the simple default model.

Naïve Bayes Model - Model Tuning (Grid Search)


Below is the parameter grid which is given as the input for the Naïve Bayes Model.
param_grid_rf = {'var_smoothing': np.logspace(0,-9, num=1000)}

After the GridSearchCV function execution is complete, the set of best selected parameters for
the model is as follows.
{'var_smoothing': 0.16114142772530193}

1.6.3.5.1 Naïve Bayes Model - Simple Model Vs Tuned Model


We compare these results with the Naïve Bayes model that was previously created in 1.5.2 Naïve
Bayes Analysis Model.
1. The scores for the train data has remained the almost the same.
2. The score for the test dataset has improved. Although the improvement is small, it
shows that the GridSearchCV operation gave positive results in this case.

Model Dataset Accuracy Score


Train Dataset 83.50
Simple Naïve Bayes Model
Test Dataset 82.23
Train Dataset 83.22
Tuned Naïve Bayes Model
Test Dataset 82.67
Table 1-35 Naïve Bayes-Simple Vs Tuned Model Comparison

Conclusion: The tuned model has performed slightly better than the simple default model.

Bagging Classifier Model - Model Tuning (Grid Search)


Below is the parameter grid which is given as the input for the Bagging Classifier Model.
param_grid_rf = {
'base_estimator__max_depth' : [2, 3, 4, 5, 6],
'base_estimator__min_samples_leaf' : [10, 15, 20, 25, 30],
'base_estimator__min_samples_split' : [30, 45, 60, 75, 90],
'n_estimators': [100,125,150,175,200,225,250]}

38
Machine Learning - Project

The parameters with base_estimator__ are options for the DecisionTreeClassifier model
which is the base estimator input to the BaggingClassifier. After the function execution is
complete, we check the best selected parameters from this.
{'base_estimator__max_depth': 4,
'base_estimator__min_samples_leaf': 15,
'base_estimator__min_samples_split': 45,
'n_estimators': 175}

1.6.3.6.1 Bagging Classifier - Simple Model Vs Tuned Model


We compare these results with the Bagging Classifier model that was previously created in 1.6.1.1
Bagging Classifier Model.
1. The scores of the train data indicates that the model is no longer over-fitted. The scores
for the test data has reduced slightly.
2. The GridSearchCV operation gave a better model as it is not over-fitted.

Model Dataset Accuracy Score


Train Dataset 100
Simple Bagging Model
Test Dataset 81.79
Train Dataset 84.44
Tuned Bagging Model
Test Dataset 80.70
Table 1-36 Bagging Classifier-Simple vs Tuned Model Comparison

Conclusion: The simple model has performed slightly better for the test data but as the tuned
model is not over-fitted, the tuned model is better.

Random Forest Bagging Model– Best Inputs (Grid Search)


Below is the parameter grid which is given as the input for the Random Forest Bagging model.
'max_depth': [6,7,8],
'max_features': [4,5,6],
'min_samples_leaf': [20, 25, 30],
'min_samples_split': [60, 75, 90],
'n_estimators': [200, 250, 300] }

After the function execution is complete, we check the best selected parameters from this.
{'max_depth': 6,
'max_features': 4,
'min_samples_leaf': 25,
'min_samples_split': 60,
'n_estimators': 300}

39
Machine Learning - Project

With these values set, we recheck the features’ importance. There is a slight change in the
feature importance as compared to the one calculated in Table 1-26 RF-Computed Importance
for All Features.
The top three feature importance have remained the same.

Columns Importance
Hague 0.3517
Blair 0.2630
Europe 0.2141
National Economic Cond. 0.0666
Political Knowledge 0.0553
Age 0.0321
Household Economic Cond. 0.0130
Gender 0.0044
Table 1-37 RF-Computed Importance for All Features – GridSearchCV Best Parameters

1.6.3.7.1 RF Modelling: Simple Model Vs Tuned Model


We compare these results with the RF model that was previously created in 1.6.1.2 Random
Forest Bagging Classifier.
1. The scores for the train and test dataset are similar indicating that the generated model
is not over-fitted.
2. The score of the model is good for both the test and training datasets indicating good
model performance for the prediction of the output.

Model Dataset Accuracy Score


Train Dataset 84.63
Simple RF
Test Dataset 82.45
Train Dataset 85.20
Tuned RF Model
Test Dataset 82.67
Table 1-38 RF-Simple vs Tuned Model Comparison

Conclusion: The tuned model has performed slightly better as compared to the simple model.

Ada Boost Model - Model Tuning (Grid Search)


Below is the parameter grid which is given as the input for the Ada Boost Model.
param_grid_rf = {
'n_estimators':[100, 250, 500, 600, 700],
'learning_rate': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1]}

Post execution of GridSearchCV, we check the best selected parameters from this.

40
Machine Learning - Project

{'learning_rate': 1, 'n_estimators': 250}

1.6.3.8.1 Ada Boost Model - Simple Model Vs Tuned Model


We compare these results with the Ada Boost model that was previously created in1.6.2.1 Ada
Boost.
1. The score for the train data and the test data has reduced slightly.
2. In this case, the model tuning operation did not generate any useful results.

Model Dataset Accuracy Score


Train Dataset 89.32
Simple Ada Boost Model
Test Dataset 81.79
Train Dataset 89.25
Tuned Ada Boost Model
Test Dataset 81.57
Table 1-39 Ada Boost -Simple Vs Tuned Model Comparison

Conclusion: The tuned model has not given better results as compared to the simple model.

Gradient Boost Model - Model Tuning (Grid Search)


Below is the parameter grid which is given as the input for the Gradient Boost Model.
param_grid_rf = { 'n_estimators':[100, 250, 500, 600, 700],
'learning_rate': [0.01, 0.1, 0.2, 0.3, 0.35, 0.5, 0.7, 0.8, 1]}

Post execution of GridSearchCV, we check the best selected parameters from this.
{'base_estimator__max_depth': 4,
'base_estimator__min_samples_leaf': 15,
'base_estimator__min_samples_split': 45,
'n_estimators': 175}

1.6.3.9.1 Gradient Boost Model - Simple Model Vs Tuned Model


We compare these results with the Gradient Boost model that was previously created in 1.6.2.2
Gradient Boosting Model.
1. The score for the train data is worse for the tuned model but the score for the test data
has improved
2. In this case, the model tuning operation gave positive results.

Model Dataset Accuracy Score


Train Dataset 96.94
Simple Gradient Boost Model
Test Dataset 82.67

41
Machine Learning - Project

Train Dataset 91.84


Tuned Gradient Boost Model
Test Dataset 83.33
Table 1-40 Gradient Boost -Simple Vs Tuned Model Comparison

Conclusion: The tuned model has performed better than the simple default model.

1.6.4 All Model Scores


Below are scores for all the models that have been developed in tabulated format. We select the
best model of each model type and further check them for their performance i.e. AUC, confusion
matrix, ROC curve etc.
The best performer in each model type has been highlighted. Performance parameters for that
model will be checked in depth in section 1.7.2 Model Performance Decisions.

Train
Model Test Score Best Parameters (when applicable)
Score
Logistic Regression Scaled 83.13 83.55
Logistic Regression Smote 83.36 80.7
{'l1_ratio': 0.25,
'max_iter': 10000,
GridSearchCV Logistic
83.13 82.89 'penalty': 'l1',
Regression
'solver': 'saga',
'tol': 0.01}

LDA 83.41 83.33


LDA Scaled 83.41 83.33
LDA Smote 83.69 81.14
GridSearchCV LDA 83.41 83.33 {'solver': 'svd', 'tol': 1e-05}

KNN Scaled 85.67 82.02


KNN Smote 88.53 80.04
Neighbor Search KNN 84.45 84.21 n_neighbors=9

Naive Bayes 83.51 82.24


Naive Bayes Scaled 83.51 82.24
Naive Bayes Smote 83.29 80.7
{'var_smoothing':
GridSearchCV Naive Bayes 83.22 82.68
0.16114142772530193}

Bagging Classifier 100 82.02


Bagging Classifier Scaled 100 81.8
Bagging Classifier Smote 100 81.36

42
Machine Learning - Project

{'base_estimator__max_depth': 4,
'base_estimator__min_samples_le
GridSearchCV Bagging af': 15,
84.44 80.70
Classifier 'base_estimator__min_samples_sp
lit': 45,
'n_estimators': 175}

RF Bagging 84.64 82.46


RF Bagging Scaled 84.64 82.46
RF Bagging Smote 86.47 79.17
{'max_depth': 6,
'max_features': 4,
GridSearchCV RF Bagging 85.20 82.67 'min_samples_leaf': 25,
'min_samples_split': 60,
'n_estimators': 300}

Ada Boost Classifier 85.77 80.92


Ada Boost Classifier Scaled 85.77 80.92
Ada Boost Classifier Smote 89.32 81.8
{'learning_rate': 1,
GridSearchCV Ada Boost 89.26 81.58
'n_estimators': 250}

Gradient Boost 95.57 82.24


Gradient Boost Scaled 95.57 82.24
Gradient Boost Smote 96.95 82.68
GridSearchCV Gradient {'learning_rate': 0.1,
91.84 83.33
Boost 'n_estimators': 100}
Table 1-41 All Model Scores

43
Machine Learning - Project

1.7 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score
for each model. Final Model: Compare the models and write inference which
model is best/optimized. (7 marks)
Using the best models generated (see section 1.6.4 All Model Scores) detailed performance
parameters are computed and compared. The performance parameters are the confusion matrix,
the classification report, AUC score and the ROC curve.

1.7.1 Performance Metrics


Confusion Matrix
A confusion matrix is NxN matrix used for evaluating the performance of a classification model,
where N is the number of target classes. It compares
PREDICTED the actual target values with those predicted by the
built machine learning model. This gives us a holistic
Positive Negative view of how well our classification model is
performing and what kinds of errors it is making.
Positive

• A classification target variable (binary) has two


TRUE LABEL

TP FN possible values, Positive/Negative, Up/Down etc.


• The columns represent the predicted values of the
target variable
Negative

• The rows represent the actual values of the target


FP TN
variable
Figure 1-26 Confusion Matrix

The components that form the matrix are:

• True Positive (TP)


Actual value and the model predicted value match and the predicted value is Positive
• True Negative (TP)
Actual value and the model predicted value match and the predicted value is Negative
• False Positive (FP)
Actual value and the model predicted value match do not match. The actual value is
Negative but was incorrectly predicted as Positive
This is known as Type I error.
• False Negative (FN)
Actual value and the model predicted value match do not match. The actual value is
Positive but was incorrectly predicted as Negative
This is known as Type II error.

44
Machine Learning - Project

Using the confusion matrix, the metrics accuracy, precision, recall and specificity are derived.
1.7.1.1.1 Accuracy
Accuracy (ACC) is the number of all correct predictions divided by the total number of the dataset.
The best accuracy is 1.0, whereas the worst is 0.0

(𝑇𝑃 + 𝑇𝑁)
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
Formula 1-2 Confusion Matrix – Accuracy

Accuracy is not the best metric to be checked especially if there is an imbalanced dataset. In such
cases, accuracy metric does not give correct understanding. In order to mitigate this, we use the
additional metrics of precision and recall.
1.7.1.1.2 Precision
Precision (PREC) is calculated as the number of correct positive predictions divided by the total
number of positive predictions. It tells us how may correctly predicted true cases are actually
positive.
It is also called positive predictive value (PPV). The best precision is 1.0, whereas the worst is 0.0.
Precision is a useful metric in cases where False Positive is a higher concern than False Negatives
(e.g.: In e-commerce recommendations, wrong results could lead to customer churn).

𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
(𝑇𝑃 + 𝐹𝑃)
Formula 1-3 Confusion Matrix - Precision

1.7.1.1.3 Recall/Sensitivity
Recall is calculated as the number of correct positive predictions divided by the total number of
positives i.e. the actual positive cases we were able to predict correctly with our model.

𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 =
(𝑇𝑃 + 𝐹𝑁)
Formula 1-4 Confusion Matrix - Recall

It is also referred to as the true positive rate (TPR). The best recall is 1.0, whereas the worst is
0.0. Recall is a useful metric in cases where False Negatives is a higher concern than False
Positives (e.g.: In medical diagnosis raising a false alarm may be safer).
1.7.1.1.4 Specificity
Specificity is calculated as the number of correct negative predictions divided by the total number
of negatives.

45
Machine Learning - Project

𝑇𝑁
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
(𝑇𝑁 + 𝐹𝑃)
Formula 1-5 Confusion Matrix - Specificity

It is also referred to as the true negative rate (TNR). The best recall is 1.0, whereas the worst is
0.0.
1.7.1.1.5 F1 Score
Recall and Precision metrics are inversely proportional to each other. The best way to capture
the trend is to use a combination of both which gives us the F1-Score metric. The F1 score is a
weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is
0.0.

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑇𝑃
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ∙ =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 1
𝑇𝑃 + (𝐹𝑃 + 𝐹𝑁)
2
Formula 1-6 Confusion Matrix – F1 Score

The interpretability of the F1-score is poor on its own. Using it in combination with other
evaluation metrics which gives us a complete picture of the result.
1.7.1.1.6 Classification Report
This report displays the precision, recall, F1, and support scores for the created model. It is
generated by the classification_report function of the sklearn library. A sample report is shown
below:
precision recall f1-score support
0 0.78 0.91 0.84 300
1 0.71 0.47 0.57 600
accuracy 0.77 900
macro avg 0.75 0.69 0.70 900
weighted avg 0.76 0.77 0.75 900
Table 1-42 Sample Classification Report

ROC Curve and AUC Score


AUC (Area Under the Curve) and ROC (Receiver Operating Characteristics) are important model
performance measurement visualizations.
The ROC is a probability curve which is a plot of the TPR (true Positive Rate) against the FPR (False
Positive rate) where the TPR is in the y-axis and FPR is on the x-axis.
AUC represents the area under the ROC curve. Higher the AUC, better the model is correctly
classifying the instances. The ideal ROC curve should extend to the top left corner which would
result in the AUC to be 1.

46
Machine Learning - Project

1.7.2 Model Performance Decisions


In order to judge all the generated models, we use the following set of guidelines.
1. We hold accuracy as the best metric to check for this data set. This is because we are not
looking for a better FP or TN value. We are looking to check Labour vs Conservative
predictions by the model.

2. We also check the other scores and hope to see them fairly balanced. If a party is going to
use this model in areas to predict its chances for victory, recall is better metric to track. Recall
will let us know hoe accurately the model is able to identify the party that the voter will
choose.

3. The F1-Score value is checked to see if is having a high value

4. AUC is checked to see if the value is high. The curve shape is also checked to see if it is
extending up to the top left corner

Note: In the below details:

• 0 indicates Conservative
• 1 indicates Labour

47
Machine Learning - Project

1.7.3 Logistical Regression Model – Complete Performance


Training Data Test Data

Figure 1-27 Log. Reg. Training Data Confusion Matrix Figure 1-28 Log. Reg. Test Data Confusion Matrix

precision recall f1-score support precision recall f1-score support


Conservative 0.75 0.63 0.68 307 Conservative 0.76 0.72 0.74 153
Labour 0.86 0.91 0.89 754 Labour 0.86 0.88 0.87 303
accuracy 0.83 1061 accuracy 0.83 456
macro avg 0.80 0.77 0.78 1061 macro avg 0.81 0.80 0.81 456
weighted avg 0.83 0.83 0.83 1061 weighted avg 0.83 0.83 0.83 456
Table 1-43 Log. Reg. Training Data Classification Report Table 1-44 Log. Reg. Training Data Classification Report

48
Machine Learning - Project

Figure 1-29 Log. Reg. Training Data ROC-AUC Curve Figure 1-30 Log. Reg. Test Data ROC-AUC Curve

• Accuracy is fairly high for the training data prediction • Accuracy is fairly high for the test data prediction
• Recall value and F1-Score is high for only Labour as compared • Recall value and F1-Score is much better for both Labour as
to Conservative. and Conservative as compared to the training data scores.
• AUC score indicates that the model is good • AUC score indicates that the model is good
Table 1-45 Log. Reg. Metrics – Train and Test Data

Conclusion: Excellent model

49
Machine Learning - Project

1.7.4 LDA Model – Complete Performance


Training Data Test Data

Figure 1-31 LDA Training Data Confusion Matrix Figure 1-32 LDA Test Data Confusion Matrix

precision recall f1-score support precision recall f1-score support


Conservative 0.74 0.65 0.69 307 Conservative 0.77 0.73 0.74 153
Labour 0.86 0.91 0.89 754 Labour 0.86 0.89 0.88 303
accuracy 0.83 1061 accuracy 0.83 456
macro avg 0.80 0.78 0.79 1061 macro avg 0.82 0.81 0.81 456
weighted avg 0.83 0.83 0.83 1061 weighted avg 0.83 0.83 0.83 456
Table 1-46 LDA Training Data Classification Report Table 1-47 LDA Test Data Classification Report

50
Machine Learning - Project

Figure 1-33 LDA Training Data ROC-AUC Curve Figure 1-34 LDA Test Data ROC-AUC Curve

• Accuracy is fairly high for the training data prediction • Accuracy is fairly high for the test data prediction
• Recall value and F1-Score is high for only Labour as compared • Recall value and F1-Score is much better for both Labour as
to Conservative. and Conservative as compared to the training data scores.
• AUC score indicates that the model is good • AUC score indicates that the model is good
Table 1-48 LDA Metrics – Train and Test Data

Conclusion: Excellent model

51
Machine Learning - Project

1.7.5 KNN Model – Complete Performance


Training Data Test Data

Figure 1-35 KNN Training Data Confusion Matrix Figure 1-36 KNN Test Data Confusion Matrix

precision recall f1-score support precision recall f1-score support


Conservative 0.75 0.70 0.72 307 Conservative 0.78 0.74 0.76 153
Labour 0.88 0.90 0.89 754 Labour 0.87 0.89 0.88 303
accuracy 0.84 1061 accuracy 0.84 456
macro avg 0.81 0.80 0.81 1061 macro avg 0.83 0.82 0.82 456
weighted avg 0.84 0.84 0.84 1061 weighted avg 0.84 0.84 0.84 456
Table 1-49 KNN Training Data Classification Report Table 1-50 KNN Test Data Classification Report

52
Machine Learning - Project

Figure 1-37 KNN Training Data ROC-AUC Curve Figure 1-38 KNN Test Data ROC-AUC Curve

• Accuracy is high for the training data prediction • Accuracy has improved for the test data prediction
• Recall value and F1-Score is high for both Labour and • Recall value and F1-Score is much better for both Labour as
Conservative. and Conservative as compared to the training data scores.
• AUC score indicates that the model is very good • AUC score indicates that the model is very good
Table 1-51 KNN Metrics – Train and Test Data

Conclusion: Excellent model

53
Machine Learning - Project

1.7.6 Naïve Bayes Model – Complete Performance


Training Data Test Data

Figure 1-39 Naïve Bayes Training Data Confusion Matrix Figure 1-40 Naïve Bayes Test Data Confusion Matrix

precision recall f1-score support precision recall f1-score support


Conservative 0.75 0.63 0.68 307 Conservative 0.77 0.69 0.73 153
Labour 0.86 0.92 0.89 754 Labour 0.85 0.89 0.87 303
accuracy 0.83 1061 accuracy 0.83 456
macro avg 0.81 0.77 0.78 1061 macro avg 0.81 0.79 0.80 456
weighted avg 0.83 0.83 0.83 1061 weighted avg 0.82 0.83 0.82 456
Table 1-52 Naïve Bayes Training Data Classification Report Table 1-53 Naïve Bayes Test Data Classification Report

54
Machine Learning - Project

Figure 1-41 Log. Reg. Training Data ROC-AUC Curve Figure 1-42 Naïve Bayes Test Data ROC-AUC Curve

• Accuracy is high for the training data prediction • Accuracy is high for the test data prediction
• Recall value and F1-Score is high for only Labour as compared • Recall value and F1-Score is better for only Conservative and
to Conservative. has slightly reduced for Labour as compared to the training
• AUC score indicates that the model is good data scores.
• AUC score indicates that the model is good
Table 1-54 Naïve Bayes Metrics – Train and Test Data

Conclusion: Good model

55
Machine Learning - Project

1.7.7 Bagging Classification Model – Complete Performance


Training Data Test Data

Figure 1-43 Bagging Classification Training Data Confusion Matrix Figure 1-44 Bagging Classification Test Data Confusion Matrix

precision recall f1-score support precision recall f1-score support


Conservative 0.75 0.70 0.72 307 Conservative 0.72 0.69 0.71 153
Labour 0.88 0.90 0.89 754 Labour 0.85 0.86 0.86 303
accuracy 0.84 1061 accuracy 0.81 456
macro avg 0.81 0.80 0.81 1061 macro avg 0.78 0.78 0.78 456
weighted avg 0.84 0.84 0.84 1061 weighted avg 0.81 0.81 0.81 456
Table 1-55 Bagging Classification Training Data Classification Report Table 1-56 Bagging Classification Test Data Classification Report

56
Machine Learning - Project

Figure 1-45 Bagging Classification Training Data ROC-AUC Curve Figure 1-46 Bagging Classification Test Data ROC-AUC Curve

• Accuracy is fairly high for the training data prediction • Accuracy has reduced for the test data prediction
• Recall value and F1-Score is high for Labour and is decent for • Recall value and F1-Score is lower for both Labour as and
Conservative. Conservative as compared to the training data scores.
• AUC score indicates that the model is good • AUC score indicates that the model is good
Table 1-57 Bagging Classification Metrics – Train and Test Data

Conclusion: Model not preferred

57
Machine Learning - Project

1.7.8 Random Forest Bagging Model – Complete Performance


Training Data Test Data

Figure 1-47 Random Forest Bagging Training Data Confusion Matrix Figure 1-48 Random Forest Bagging Test Data Confusion Matrix

precision recall f1-score support precision recall f1-score support


Conservative 0.79 0.65 0.72 153
Conservative 0.79 0.66 0.72 307
Labour 0.84 0.91 0.88 303
Labour 0.87 0.93 0.90 754
accuracy 0.83 456
accuracy 0.85 1061
macro avg 0.82 0.78 0.80 456
macro avg 0.83 0.80 0.81 1061
weighted avg 0.82 0.83 0.82 456
weighted avg 0.85 0.85 0.85 1061 Table 1-59 Random Forest Bagging Test Data Classification Report
Table 1-58 Random Forest Bagging Training Data Classification Report

58
Machine Learning - Project

Figure 1-49 Random Forest Bagging Training Data ROC-AUC Curve Figure 1-50 Random Forest Bagging Test Data ROC-AUC Curve

• Accuracy is high for the training data prediction • Accuracy has reduced for test data but is still is fairly high for
• Recall value and F1-Score is high for only Labour as compared the test data prediction
to Conservative. • Recall value and F1-Score is has reduced for both Labour as
• AUC score indicates that the model is good and Conservative as compared to the training data scores.
• AUC score indicates that the model is good
Table 1-60 Random Forest Bagging Metrics – Train and Test Data

Conclusion: Model not preferred

59
Machine Learning - Project

1.7.9 Ada Boost Model – Complete Performance


Training Data Test Data

Figure 1-51 Ada Boost Training Data Confusion Matrix Figure 1-52 Ada Boost Test Data Confusion Matrix

precision recall f1-score support precision recall f1-score support


Conservative 0.89 0.89 0.89 754 Conservative 0.74 0.71 0.72 153
Labour 0.89 0.90 0.89 754 Labour 0.85 0.87 0.86 303
accuracy 0.89 1508 accuracy 0.82 456
macro avg 0.89 0.89 0.89 1508 macro avg 0.80 0.79 0.79 456
weighted avg 0.89 0.89 0.89 1508 weighted avg 0.82 0.82 0.82 456
Table 1-61 Ada Boost Training Data Classification Report Table 1-62 Ada Boost Test Data Classification Report

60
Machine Learning - Project

Figure 1-53 Ada Boost Training Data ROC-AUC Curve Figure 1-54 Ada Boost Test Data ROC-AUC Curve

• Accuracy is very high for the training data prediction • Accuracy is high for the test data prediction
• Recall value and F1-Score is high for Labour and • Recall value and F1-Score has reduced for both Labour as and
Conservative. Conservative as compared to the training data scores but still
• AUC score indicates that the model is good is high.
• AUC score indicates that the model is good
Table 1-63 Ada Boost Metrics – Train and Test Data

Conclusion: Excellent Model

61
Machine Learning - Project

1.7.10 Gradient Boost Model – Complete Performance


Training Data Test Data

Figure 1-55 Gradient Boost Training Data Confusion Matrix Figure 1-56 Gradient Boost Test Data Confusion Matrix

precision recall f1-score support precision recall f1-score support


Conservative 0.91 0.93 0.92 754 Conservative 0.75 0.76 0.75 153
Labour 0.93 0.91 0.92 754 Labour 0.88 0.87 0.87 303
accuracy 0.92 1508 accuracy 0.83 456
macro avg 0.92 0.92 0.92 1508 macro avg 0.81 0.81 0.81 456
weighted avg 0.92 0.92 0.92 1508 weighted avg 0.83 0.83 0.83 456
Table 1-64 Gradient Boost Training Data Classification Report Table 1-65 Gradient Boost Test Data Classification Report

62
Machine Learning - Project

Figure 1-57 Gradient Boost Training Data ROC-AUC Curve Figure 1-58 Gradient Boost Test Data ROC-AUC Curve

• Accuracy is very high for the training data prediction • Accuracy is high for the test data prediction
• Recall value and F1-Score is high for Labour and • Recall value and F1-Score has reduced for both Labour as and
Conservative. Conservative as compared to the training data scores but still
• AUC score indicates that the model is good is high.
• AUC score indicates that the model is good
Table 1-66 Gradient Boost Metrics – Train and Test Data

Conclusion: Excellent Model

63
Machine Learning - Project

1.7.11 All Models Performance Comparison


The classification report data which was independently displayed for each model has been tabulated and graphically plotted to identify
the 3 best models.
Train Train Train Train F1- Train Test Test Test Test F1- Test
Accuracy Recall Precision Score AUC Accuracy Recall Precision Score AUC
GridSearchCV Logistic
83 91 86 89 0.89 83 88 86 87 0.88
Regression
LDA Scaled 83 91 86 89 0.89 83 89 86 88 0.89
Neighbor Search KNN 84 90 88 89 0.91 84 89 87 88 0.89
GridSearchCV Naive Bayes 83 92 86 89 0.89 83 89 85 87 0.88
GridSearchCV Bagging
84 90 88 89 0.91 81 86 85 86 0.88
Classifier
GridSearchCV RF Bagging 85 93 87 90 0.91 83 91 84 88 0.89
Ada Boost Classifier Smote 89 90 89 89 0.96 82 87 85 86 0.88
GridSearchCV Gradient Boost 92 91 93 92 0.97 83 87 88 87 0.9
Table 1-67 All Models Scores

64
Machine Learning - Project

Top Accuracy (Test Data): KNN, Logistic Regression, LDA

Figure 1-59 All Models – Accuracy

Top Precision (Test Data): Gradient Boost, KNN, LDA

Figure 1-60 All Models – Precision

Top Recall (Test Data): RF, KNN, LDA

65
Machine Learning - Project

Figure 1-61 All Models – Recall

Top F1-Score (Test Data): LDA, KNN, RF

Figure 1-62 All Models – F1-Score

66
Machine Learning - Project

Top AUC (Test Data): Gradient Boost, KNN, LDA

Figure 1-63 All Models - AUC

The ROC curve for training data shows best scores for Ada Boost and Gradient Boost models.

Figure 1-64 All Models – Training Data ROC

67
Machine Learning - Project

The ROC curve for test data is almost same for all models, best scores is seen for Gradient Boost
and KNN models.

Figure 1-65 All Models – Test Data ROC

1.7.12 Final Model Choice


KNN model has the best performance across all the metrics. This performance is consistent for
both the Training and Test Data. KNN is the model of choice for the voter data prediction.

68
Machine Learning - Project

1.8 Based on these predictions, what are the insights? (5 marks)

1. The data of positive government actions from constituencies where the Conservative
party won should be taken to understand what appealed to the citizens there.
Propagating these positive work can help change the tide of victory for the Conservative
party in the future.
In similar fashion, data collection should be done in the same constituency by the Labour
party to identify their shortcomings and general public opinion.

2. High Europe skepticism should be understood and handled. This may be a key factor in
ensuring victory in the next elections.

3. Labour party has a high number of people with 0 political knowledge score. Addressing
this issue may cause a shift in public opinion.

4. Applying the developed model to predict if a constituency is Labour or Conservative will


help take measurable actions like additional campaigning, addressing of citizen’s concerns
etc. to change the public outlook for that party.

69
Machine Learning - Project

2 Problem 2 Statement
In this particular project, we are going to work on the inaugural corpus from the NLTK in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the mentioned
documents. – 3 Marks
2.1.1 Number of Characters
The number of characters in each speech is computed with the raw function of the
nltk.inaugural library. This function returns the whole speech as a single string.

Applying the len function on this output gives us the number of characters in each speech. The
speech samples as returned by the raw function along with the count of characters has been
tabulated below.
The same data has been plotted for easy visual comparison.

Speech Raw Output Count


On each national day of inauguration since 1789, the people have
renewed their sense of dedication to the United States.
1941-Roosevelt ..... 7571
We do not retreat. We are not content to stand still. As Americans,
we go forward, in the service of our country, by the will of God.
Vice President Johnson, Mr. Speaker, Mr. Chief Justice, President
Eisenhower, Vice President Nixon, President Truman, reverend
clergy, fellow citizens, we observe today
1961-Kennedy .... 7618
with history the final judge of our deeds, let us go forth to lead the
land we love, asking His blessing and His help, but knowing that
here on earth God's work must truly be our own.
Mr. Vice President, Mr. Speaker, Mr. Chief Justice, Senator Cook,
Mrs. Eisenhower, and my fellow citizens of this great and good
country we share together:
1973-Nixon ... 9991
Let us go forward from here confident in hope, strong in our faith
in one another, sustained by our faith in God who created us, and
striving always to serve His purpose.
Table 2-1 Speech Raw Output + Character Count

70
Machine Learning - Project

The 1973-Nixon speech


holds the most number of
characters. It has more
than 2000 characters as
compared to the other
speeches.
The 1941-Roosevelt and
1962-Kennedy speeches
are very close in character
count.

Figure 2-1 Character Count

2.1.2 Number of Words


The number of words in each speech is computed with the words function of the
nltk.inaugural library. This function returns the whole speech as a list of words

Applying the len function


on this output gives us the
number of words in each
speech. The speech samples
as returned by the words
function along with the
count of words has been
tabulated below.
The same data has been
plotted for easy visual
comparison.
Figure 2-2 Word Count

Speech Words Output Count


1941-Roosevelt ['On', 'each', 'national', 'day', 'of', 'inauguration', ...] 1536
1961-Kennedy ['Vice', 'President', 'Johnson', ',', 'Mr', '.', ...] 1546
1973-Nixon ['Mr', '.', 'Vice', 'President', ',', 'Mr', '.', ...] 2028
Table 2-2 Speech Words Output + Words Count

71
Machine Learning - Project

In similar fashion as character count, the 1973-Nixon speech holds the most number of words. It
has more than 500 words as compared to the other speeches.
The 1941-Roosevelt and 1962-Kennedy speeches are very close in word count.

2.1.3 Number of Sentences


The number of sentences in each speech is computed with the sents function of the
nltk.inaugural library. This function returns each the speech as a list of lists where each line in
the speech is list element.
Applying the len function
on this list output gives us
the number of sentences in
each speech. The speech
samples as returned by the
sents function along with
the count of sentences has
been tabulated below.
The same data has been
plotted for easy visual
comparison.
Figure 2-3 Sentence Count

The 1973-Nixon and 1941-Roosevelt sentence counts are very close to each other. The 1962-
Kennedy speech has lesser number of sentences as compared to the other two speeches.

Speech Raw Output Count


[['On', 'each', 'national', 'day', 'of', 'inauguration', 'since', '1789', ',', 'the',
'people', 'have', 'renewed', 'their', 'sense', 'of', 'dedication', 'to', 'the',
1941-Roosevelt 'United', 'States', '.'], ['In', 'Washington', "'", 's', 'day', 'the', 'task', 'of', 68
'the', 'people', 'was', 'to', 'create', 'and', 'weld', 'together', 'a', 'nation', '.'],
...]
[['Vice', 'President', 'Johnson', ',', 'Mr', '.', 'Speaker', ',', 'Mr', '.', 'Chief',
'Justice', ',', 'President', 'Eisenhower', ',', 'Vice', 'President', 'Nixon', ',',
1961-Kennedy 52
'President', 'Truman', ',', 'reverend', 'clergy', ',', 'fellow', 'citizens', ',', 'we',
'observe', 'today', 'not', 'a', 'victory', 'of', 'party', ',', 'but', …'], ...]
[['Mr', '.', 'Vice', 'President', ',', 'Mr', '.', 'Speaker', ',', 'Mr', '.', 'Chief',
'Justice', ',', 'Senator', 'Cook', ',', 'Mrs', '.', 'Eisenhower', ',', 'and', 'my',
1973-Nixon 69
'fellow', 'citizens', 'of', 'this', 'great', 'and', 'good', 'country', 'we', 'share',
'together', ':'], ...]
Table 2-3 Speech Sentence Output + Sentence Count

72
Machine Learning - Project

2.2 Remove all the stop words from all three speeches. – 3 Marks
2.2.1 Lower Case Words
Before any stemming or stop words removal is undertaken, all the words in each speech are
converted to lower case. This helps in correct identification as “The” is not same as “the”.
Removing of case dependency by making all words lower case has made the speech analysis more
streamlined.

2.2.2 Stemming Words


Stemming is the process of reducing a word to its base form by removing affixes from them. For
example, the stem of the words eating, eats, eaten is eat.
For the speeches, we apply the stemming using the PorterStemmer function from the library
nltk.stem. Below is the first sentence of the input speech given to the stemmer and output
received from it. The differences in the words before and after stemming have been highlighted.
1941-Roosevelt – Before and after Stemming
Before:
'on', 'each', 'national', 'day', 'of', 'inauguration', 'since', '1789', ',', 'the', 'people', 'have',
'renewed', 'their', 'sense', 'of', 'dedication', 'to', 'the', 'united', 'states', '.'
After:
'on', 'each', 'nation', 'day', 'of', 'inaugur', 'sinc', '1789', ',', 'the', 'peopl', 'have', 'renew',
'their', 'sens', 'of', 'dedic', 'to', 'the', 'unit', 'state', '.

1961-Kennedy – Before and after Stemming


Before:
['vice', 'president', 'johnson', ',', 'mr', '.', 'speaker', ',', 'mr', '.', 'chief', 'justice', ',',
'president', 'eisenhower', ',', 'vice', 'president', 'nixon', ',', 'president', 'truman', ',',
'reverend', 'clergy', ',', 'fellow', 'citizens', ',', 'we', 'observe', 'today', 'not', 'a', 'victory', 'of',
'party', ',', 'but', 'a', 'celebration', 'of', 'freedom',
After:
['vice', 'presid', 'johnson', ',', 'mr', '.', 'speaker', ',', 'mr', '.', 'chief', 'justic', ',', 'presid',
'eisenhow', ',', 'vice', 'presid', 'nixon', ',', 'presid', 'truman', ',', 'reverend', 'clergi', ',',
'fellow', 'citizen', ',', 'we', 'observ', 'today', 'not', 'a', 'victori', 'of', 'parti', ',', 'but', 'a',
'celebr', 'of', 'freedom'

73
Machine Learning - Project

1973-NIxon – Before and after Stemming


Before:
'mr', '.', 'vice', 'president', ',', 'mr', '.', 'speaker', ',', 'mr', '.', 'chief', 'justice', ',', 'senator',
'cook', ',', 'mrs', '.', 'eisenhower', ',', 'and', 'my', 'fellow', 'citizens', 'of', 'this', 'great', 'and',
'good', 'country', 'we', 'share', 'together', ':', 'when', 'we', 'met', 'here', 'four', 'years', 'ago',
',', 'america', 'was', 'bleak', 'in', 'spirit', ',', 'depressed', 'by', 'the', 'prospect', 'of',
'seemingly', 'endless', 'war', 'abroad', 'and', 'of', 'destructive', 'conflict', 'at', 'home', '.',
After:
'mr', '.', 'vice', 'presid', ',', 'mr', '.', 'speaker', ',', 'mr', '.', 'chief', 'justic', ',', 'senat', 'cook', ',',
'mr', '.', 'eisenhow', ',', 'and', 'my', 'fellow', 'citizen', 'of', 'thi', 'great', 'and', 'good', 'countri',
'we', 'share', 'togeth', ':', 'when', 'we', 'met', 'here', 'four', 'year', 'ago', ',', 'america', 'wa',
'bleak', 'in', 'spirit', ',', 'depress', 'by', 'the', 'prospect', 'of', 'seemingli', 'endless', 'war',
'abroad', 'and', 'of', 'destruct', 'conflict', 'at', 'home', '.',

Conclusion on Stemming
In this case, the words of the speeches will be used to create word clouds. Taking this into
account, we are not going to use the stemmed data output in further operations.

2.2.3 Stop Words Cleanup


Stop words are the most common words that many search engines, text analytics etc. avoid in
order to save space and time in processing of large data. Before text is analyzed, these stop-
words are filtered out of the text.
1. For our purpose, we are using the stop words provided by the function stopwords of the
nltk.corpus library for the English language.
2. The list obtained from this function has the necessary punctuation marks added to it.
After constructing the stop-words list, we clean up all the speeches.

Speech Count Before Stop Words Clearance Count After Stop Words Clearance
1941-Roosevelt 1536 657
1961-Kennedy 1546 722
1973-Nixon 2028 853
Table 2-4 Speech Output Post Cleanup

74
Machine Learning - Project

We can see that the


words in all the speeches
has significantly
reduced.

Figure 2-4 Word Count – Before


and After Stop-Words Cleanup

The differences in the words before and after stop words cleanup have been highlighted. The
words/punctuation marks which are highlighted in red have been removed post the clean
operation.
1941-Roosevelt – Before and After Stop Words Cleanup
Before:
'on', 'each', 'national', 'day', 'of', 'inauguration', 'since', '1789', ',', 'the', 'people', 'have',
'renewed', 'their', 'sense', 'of', 'dedication', 'to', 'the', 'united', 'states', '.', 'in',
'washington', "'", 's', 'day', 'the', 'task', 'of', 'the', 'people', 'was', 'to', 'create', 'and',
'weld', 'together', 'a', 'nation', '.',
After:
'national', 'day', 'inauguration', 'since', '1789', 'people', 'renewed', 'sense', 'dedication',
'united', 'states', 'washington', 'day', 'task', 'people', 'create', 'weld', 'together', 'nation',

1961-Kennedy – Before and After Stop Words Cleanup


Before:
'vice', 'president', 'johnson', ',', 'mr', '.', 'speaker', ',', 'mr', '.', 'chief', 'justice', ',',
'president', 'eisenhower', ',', 'vice', 'president', 'nixon', ',', 'president', 'truman', ',',
'reverend', 'clergy', ',', 'fellow', 'citizens', ',', 'we', 'observe', 'today', 'not', 'a', 'victory', 'of',
'party', ',', 'but', 'a', 'celebration', 'of', 'freedom', '--', 'symbolizing', 'an', 'end',

75
Machine Learning - Project

After:
'vice', 'president', 'johnson', 'mr', 'speaker', 'mr', 'chief', 'justice', 'president', 'eisenhower',
'vice', 'president', 'nixon', 'president', 'truman', 'reverend', 'clergy', 'fellow', 'citizens',
'observe', 'today', 'victory', 'party', 'celebration', 'freedom', '--', 'symbolizing', 'end',

1973-Nixon – Before and After Stop Words Cleanup


Before:
'mr', '.', 'vice', 'president', ',', 'mr', '.', 'speaker', ',', 'mr', '.', 'chief', 'justice', ',', 'senator',
'cook', ',', 'mrs', '.', 'eisenhower', ',', 'and', 'my', 'fellow', 'citizens', 'of', 'this', 'great', 'and',
'good', 'country', 'we', 'share', 'together', ':', 'when', 'we', 'met', 'here', 'four', 'years',
'ago', ',', 'america', 'was', 'bleak', 'in', 'spirit', ',', 'depressed', 'by', 'the', 'prospect', 'of',
'seemingly', 'endless', 'war', 'abroad', 'and', 'of', 'destructive', 'conflict', 'at', 'home', '.',
'as', 'we', 'meet', 'here', 'today', ',', 'we', 'stand', 'on', 'the', 'threshold', 'of', 'a', 'new', 'era',
'of', 'peace', 'in', 'the', 'world', '.',
After:
'mr', 'vice', 'president', 'mr', 'speaker', 'mr', 'chief', 'justice', 'senator', 'cook', 'mrs',
'eisenhower', 'fellow', 'citizens', 'great', 'good', 'country', 'share', 'together', 'met', 'four',
'years', 'ago', 'america', 'bleak', 'spirit', 'depressed', 'prospect', 'seemingly', 'endless',
'war', 'abroad', 'destructive', 'conflict', 'home', 'meet', 'today', 'stand', 'threshold', 'new',
'era', 'peace', 'world',

2.3 Which word occurs the most number of times in his inaugural address for
each president? Mention the top three words. (after removing the stop
words) – 3 Marks
2.3.1 Most Frequent Words in Cleaned Speech
We see that the words “--“, and “let” have high frequency. These do not give any useful
information. For this reason, we add these words to the stop-words list and clean the data again.

Speech Word Count of Occurrence


“--“ 25
1941-Roosevelt “nation” 12
“know” 10
“--“ 25
1961-Kennedy “let” 16
“us” 12
1973-Nixon “us” 26

76
Machine Learning - Project

“let” 22
“america 21
Table 2-5 Words with Top Frequency

2.3.2 Most Frequent Words in Cleaned Speech (Updated Stop Words)


After appending the words “--“, and “let” to the stop-words list, we have the new top frequency
words below.

Speech Word Count of


“nation” 12
1941-Roosevelt “know” 10
“spirit” 9
“us” 12
1961-Kennedy “world” 8
“sides 8
“us” 26
1973-Nixon “america 21
“peace 19
Table 2-6 Words with Top Frequency – After Updated Stop Words

2.4 Plot the word cloud of each of the speeches of the variable. (after removing
the stop words) – 3 Marks
Word clouds are cloud like depictions of words in which the more frequency with which a specific
word appears in the source of textual data, the bigger and bolder it appears in the word cloud.
The three speeches which have been cleaned of the stop-words have their word clouds
constructed and these images are shown below. The words in the word cloud are marked in the
speech descriptions.

77
Machine Learning - Project

2.4.1 1941-Roosevelt Speech


Roosevelt’s speech concentrates on the nation of America. It speaks about the spirit of freedom
and democracy that people have in their life. It touches on the future and the mind of people.

Figure 2-5 Word Cloud - 1941-Roosevelt Speech

78
Machine Learning - Project

2.4.2 1961-Kennedy Speech


Kennedy’s speech concentrates on the world, the new generation and power. It calls upon the
nation to pledge. It refers to the power of citizens and speaks about their poverty, burden,
freedom and hope.

Figure 2-6 Word Cloud – 1961 – Kennedy Speech

79
Machine Learning - Project

2.4.3 1973-Nixon Speech


Nixon talks about the great nation of America, the world and peace. It refers to government role
and responsibility.

Figure 2-7 Word Cloud - 1973-Nixon Speech

80

You might also like