Machine Learning Project Report
Machine Learning Project Report
Table of Contents
1 Problem 1 Statement............................................................................................................... 1
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 marks) ............................................................................................. 1
1.1.1 Data Summary........................................................................................................... 1
1.1.2 Duplicated Data Summary ........................................................................................ 2
1.1.3 Descriptive Statistics ................................................................................................. 2
1.1.4 Sample Data .............................................................................................................. 3
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. (7 marks) ....................................................................................................................... 4
1.2.1 Univariate Analysis.................................................................................................... 4
Age ..................................................................................................................... 4
Numerical Categorical (Ordinal) Data ............................................................... 5
National Economic Condition ............................................................................ 5
Household Economic Condition ........................................................................ 6
Blair .................................................................................................................... 6
Hague ................................................................................................................. 7
Europe................................................................................................................ 7
Political Knowledge ........................................................................................... 8
Gender ............................................................................................................... 9
Vote ................................................................................................................. 10
Age Distributions Across Other Features ........................................................ 10
Vote Distributions Across Other Features ....................................................... 14
Bivariate Analysis ............................................................................................. 18
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30). (4 marks) ................................................ 20
1.3.1 Encoding of Categorical Data .................................................................................. 20
1.3.2 Spilt of Data – Train and Test .................................................................................. 20
1.3.3 Scaling Necessity ..................................................................................................... 21
What is Scaling ................................................................................................. 21
Why is Scaling Needed..................................................................................... 21
Machine Learning - Project
List of Figures
Figure 1-1 Age Data – Boxplot and Histogram................................................................................ 4
Figure 1-2 Count of Voters Across Age Voters................................................................................ 4
Figure 1-3 NEC Data – Boxplot and Count Plot ............................................................................... 5
Figure 1-4 HEC Data – Boxplot and Count Plot ............................................................................... 6
Figure 1-5 HEC Data – Boxplot and Count Plot ............................................................................... 6
Figure 1-6 Hague Assessment Data – Boxplot and Count Plot ....................................................... 7
Figure 1-7 Europe Data – Boxplot and Count Plot .......................................................................... 8
Figure 1-8 PK Data – Boxplot and Count Plot ................................................................................. 9
Figure 1-9 Gender Count Plot ......................................................................................................... 9
Figure 1-10 Vote Count Plot.......................................................................................................... 10
Figure 1-11 Voter Age Groups – NEC and HEC Scoring................................................................. 11
Figure 1-12 Voter Age Groups – Blair-Hague Scoring ................................................................... 12
Figure 1-13 Voter Age Groups – Europe Scoring & Political Knowledge ...................................... 13
Figure 1-14 Voter Age Groups – Gender ...................................................................................... 14
Figure 1-15 Vote Decision- Age Groups ........................................................................................ 14
Figure 1-16 Vote Decision- HEC and NEC Scores .......................................................................... 15
Figure 1-17 Vote Decision- Blair-Hague Scores ............................................................................ 16
Figure 1-18 Vote Decision- Political Knowledge Scores................................................................ 16
Figure 1-19 Vote Decision- Europe Sentiment Scores .................................................................. 17
Figure 1-20 Vote Decision- Gender ............................................................................................... 17
Figure 1-21 Voters – Pair Plot ....................................................................................................... 18
Figure 1-22 Voters – Numerical Data Heat Map........................................................................... 19
Figure 1-23 Range of Data Distribution all Columns ..................................................................... 23
Figure 1-24 Range of Data Distribution all Columns After Scaling ............................................... 23
Figure 1-25 MCE-K Neighbors Plot................................................................................................ 37
Figure 1-26 Confusion Matrix ....................................................................................................... 44
Figure 1-27 Log. Reg. Training Data Confusion Matrix ................................................................. 48
Figure 1-28 Log. Reg. Test Data Confusion Matrix ....................................................................... 48
Figure 1-29 Log. Reg. Training Data ROC-AUC Curve.................................................................... 49
Figure 1-30 Log. Reg. Test Data ROC-AUC Curve .......................................................................... 49
Figure 1-31 LDA Training Data Confusion Matrix ......................................................................... 50
Figure 1-32 LDA Test Data Confusion Matrix ................................................................................ 50
Figure 1-33 LDA Training Data ROC-AUC Curve ............................................................................ 51
Figure 1-34 LDA Test Data ROC-AUC Curve .................................................................................. 51
Figure 1-35 KNN Training Data Confusion Matrix......................................................................... 52
Figure 1-36 KNN Test Data Confusion Matrix ............................................................................... 52
Figure 1-37 KNN Training Data ROC-AUC Curve ........................................................................... 53
Figure 1-38 KNN Test Data ROC-AUC Curve ................................................................................. 53
Figure 1-39 Naïve Bayes Training Data Confusion Matrix ............................................................ 54
Machine Learning - Project
List of Tables
Table 1-1 Data Dictionary for Election Data Survey Details ........................................................... 1
Table 1-2 Sample of Duplicated Voter Data ................................................................................... 2
Table 1-3 Descriptive Statistics of Voter Data (Numerical Columns) ............................................. 3
Table 1-4 Sample Electorate Distribution Data .............................................................................. 3
Machine Learning - Project
List of Formulae
Formula 1-1 Min-Max Calculation ................................................................................................ 22
Formula 1-2 Confusion Matrix – Accuracy.................................................................................... 45
Formula 1-3 Confusion Matrix - Precision .................................................................................... 45
Formula 1-4 Confusion Matrix - Recall ......................................................................................... 45
Formula 1-5 Confusion Matrix - Specificity .................................................................................. 46
Formula 1-6 Confusion Matrix – F1 Score .................................................................................... 46
Machine Learning - Project
1 Problem 1 Statement
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.
1.1 Read the dataset. Do the descriptive statistics and do the null value condition
check. Write an inference on it. (4 marks)
1.1.1 Data Summary
The summary describes the data type and the number of data entries in each of the columns in
the dataset. The presence of null data and duplicated data is also noted.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1525 non-null int64
1 Vote 1525 non-null object
2 Age 1525 non-null int64
3 National Economic Cond. 1525 non-null int64
4 Household Economic Cond. 1525 non-null int64
5 Blair 1525 non-null int64
6 Hague 1525 non-null int64
1
Machine Learning - Project
2
Machine Learning - Project
3
Machine Learning - Project
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check
for Outliers. (7 marks)
1.2.1 Univariate Analysis
Age
Distribution: The age data is fairly normally distributed with the a very wide top indicating that
the number of people are evenly distributed across all ages.
Skew: Data has very low skew with no long tail on either end
Skewness: 0.14
Outliers: Age data has no outliers.
4
Machine Learning - Project
• Largest percent of the population scored NEC as 3 or 4. This shows that majority (~75% of
total) of the survey demographic hold a moderate view of the NEC.
• Fewest members of the survey demographic scored NEC as 1.
5
Machine Learning - Project
• Largest percent of the population scored HEC as 3 followed by a score of 4. This shows
that majority of the survey demographic (~70%) hold a moderate view of the HEC
• Fewest members of the survey demographic hold the poorest assessment of HEC
Blair
The boxplot shows no outliers.
6
Machine Learning - Project
• Largest percent of the population (~65) scored Blair (labor leader) as 4 and 5. This shows
the labor party has a good standing
Hague
The boxplot shows no outliers.
• The number of people who scored the conservative leadership as 2 or 4 are close in
number (40.92 for score of 2 vs 36.59 for a score of 4). This indicates that there is a wide
divide between the opinions of the people here
• A higher percent of people (~55%) have a poor assessment of Hague, they have scored 1
or 2 (seen in table below).
Europe
The boxplot shows no outliers. The median value of the distribution is 6.
7
Machine Learning - Project
• ~22% of people have high skepticism where Europe is concerned. This is the most popular
score.
• If we group 1-5 as positive towards Europe, we have around 37% of the demographic.
• Taking a score of 6 as neutral, we keep score of 7-11 as negative. 49% of survey takers fall
in this sentiment bracket.
Political Knowledge
The boxplot shows no outliers.
8
Machine Learning - Project
• More than 65% of the population have a PK sore of 2 or higher which indicates good
general knowhow of the political situation.
• ~29% of people have a PK score of 0 which is a cause for concern.
Gender
This column holds the gender distribution of the
survey takers. There is a higher number of
females as compared to males.
9
Machine Learning - Project
Vote
This column holds the distribution of votes cast
by the demographic. This is the target data of
our model generation i.e. our goal is for our
model to correctly predict if a citizen shall vote
for the Labor or the Conservative party.
In this data it is clear that Labor has a clear
majority of votes. This shows that there is a class
imbalance in the target variable.
NEC HEC
Age Group
Popular Score Unpopular Score Popular Score Unpopular Score
0, 25 4 1/5 4 1
25, 45 3 1 4 1
45, 65 3 1 3 5
65, 100 3 1 3 1
Table 1-14 NEC & HEC Across Age Groups
1. HEC and NEC scores follow a similar pattern across all age groups i.e. the popular and the
unpopular scores for both HEC and NEC are around the same. This indicates that the
voters associate their household economic conditions with that of the nation.
2. The most popular scores are 3 or 4 for both HEC and NEC.
3. The least popular scores are mostly 1. This shows that the voters have a generally positive
attitude towards the NEC and HEC
10
Machine Learning - Project
Blair Hague
Age Group
Popular Score Unpopular Score Popular Score Unpopular Score
0, 25 4 5 1 3/5
25, 45 4 1 2 3
45, 65 4 1 2 3
65, 100 4 1 2/4 3
Table 1-15 Blair & Hague Across Age Groups
11
Machine Learning - Project
3. The 60-100 age group have similar number of voters who have scored Hague 2 and 4 which shows that, that age group is
divided in its opinion.
4. The least popular scores for Blair are mostly 1 showing strong support.
5. Least number of scores for both Blair and Hague is 3. This shows all voters have strong opinions and do not prefer to score
moderately.
12
Machine Learning - Project
1. Mostly all age groups claim political knowledge on the higher end. This claim can only be substantiated by political knowledge
quiz/enquiry.
2. Least number of voters claim low political knowledge.
3. The popular scores for Europe sentiment is varied across age groups but the unpopular score is consistently Low.
Figure 1-13 Voter Age Groups – Europe Scoring & Political Knowledge
13
Machine Learning - Project
14
Machine Learning - Project
15
Machine Learning - Project
16
Machine Learning - Project
1. A clear majority of Conservative voters have given the maximum score for Europe
Sentiment (indicating negativity).
2. In comparison, although a large number of Labour voters have given the highest score of
11 for Europe sentiment (indicating negativity), a greater number of people have scored
it 6 which indicates moderate sentiment.
3. But overall, Labour voters have a scored on the lower end showing a positive sentiment
for Europe.
1.2.1.12.6 Vote + Gender Relationship
The insights party preference of the voters and their gender has been plotted and the insights
have been documented.
Following the general
vote trend seen in
1.2.1.10 Vote more
voters have selected
Labour irrespective
of their gender
17
Machine Learning - Project
18
Machine Learning - Project
Supporting the inference from the pair plot, it is clearly seen that there is low to no degree of
correlation between any of the data parameters.
19
Machine Learning - Project
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here
or not? Data Split: Split the data into train and test (70:30). (4 marks)
1.3.1 Encoding of Categorical Data
Models are developed on numerical data. With this predicate, we convert the unique values in
the categorical columns into numerical values. This converted data is used in the modelling
operations.
The voter dataset has 2 object variables. We convert these columns to be numerical in nature
before using it to build the prediction model.
• As these are not ordinal in nature, we can use one-hot-encoding process. This ensures
that the machine learning model does not assume that higher values are more important.
• But as in this case, each object data has only two values (Male/Female and
Labour/Conservative) we can go ahead and do a simple replacement or categorical codes
conversion
After completion of encoding, the below values are replaced within the dataset.
Gender Vote
Object Value Numerical Value Object Value Numerical Value
Male 0 Conservative 0
Female 1 Labour 1
Table 1-17 Categorical Values to Numerical Number Codes
20
Machine Learning - Project
• test_size: This is the size of test set. It is expressed as a percentage between 0 and 1 for
either the train or test datasets. The specified percent of data will be collected into the
test subset.
To give an example, if an example dataset has a total of 1000 rows.
On specifying test_size=0.3, the resultant test subset has 300 data points (30% of 1000)
and the training subset shall hold 700 entries (70% of 1000).
• random_state: This input is used to initialize the internal random number generator,
which decides how to split the data into train and test subsets.
This input should be set to the same value, if the same consistency is to be expected over
multiple runs of the code.
The splitting of the data is done using the test_train_split function from the python module
sklearn.
1. For the voter dataset, the execution of the split operation is performed with the inputs
random_state=1, test_size=0.3
2. The test and train data subset of independent input variables have their content as
follows:
Training subset = 1061 rows, 8 columns
Test subset = 456 rows, 8 columns
3. The test and train subset target variable details as given from the test_train_split
operation is as below:
Training subset target data = 1061 rows, 1 column
Test subset target data = 456 rows, 1 column
21
Machine Learning - Project
𝑥 − min (𝑥)
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
max(𝑥 ) − min (x)
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑆𝑐𝑎𝑙𝑒𝑑 𝑉𝑎𝑙𝑢𝑒 𝑜𝑓 𝑥
𝑥 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑉𝑎𝑙𝑢𝑒
min(𝑥) = 𝑀𝑖𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑥
max(𝑥) = 𝑀𝑎𝑥 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑥
Formula 1-1 Min-Max Calculation
22
Machine Learning - Project
We check the standard deviation ranges of the data before and after the scaling operation. It is
evident that the difference in ranges has reduced post scaling.
23
Machine Learning - Project
A sample of the dataset post scaling is shown below. Only the independent features are displayed
below.
National
Household Political
Age Economic Blair Hague Europe Gender
Economic Cond. Knowledge
Cond.
0 0.14 0.25 0.75 0 0.75 1 0.67 1
1 0.23 0.75 0.5 0.75 0.75 0.5 0 0
2 0.54 0.75 0.5 0.75 0.75 0.6 0.67 1
3 0.33 0.5 0.5 0.75 0.25 1 0 0
4 0.29 1 0.5 0.75 0.25 0.7 0 0
Table 1-19 Scaled Voter Data
24
Machine Learning - Project
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)
1.4.1 Logistic Regression Model
Logistic regression is a supervised learning technique for a binary response. The two response
classes are Positive-Negative; the output is given as the probability of positive based on the
values of the predictors
Log. Reg. Model Step1- Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
Log. Reg. Model Step2- Model Build
The logistic regression model is constructed using the function LogisticRegression from the
sklearn.linear_model library. Arguments passed for this functions are as below:
• solver=’newton-cg’
This is the algorithm to use in the optimization problem.
• max_iter=10000
10,000 is the maximum number of iterations for the solvers to converge.
• penalty = none
No penalty is added to the model
• tol=0.0001
This is the tolerance value for the stopping criteria.
• verbose=True
Setting this to true allows the progress messages to be printed out
• random_state=1
This makes the model’s output replicable. The model will always produce the same results
when it has a definite value of random_state and if it has been given the same
parameters and the same training data.
The mean accuracy of the model built using the scaled-unbalanced data and the scaled-
balanced datasets are as follows.
25
Machine Learning - Project
2. The score of the models are good for both the test and training datasets indicating good
model performance for the prediction of the output.
3. We shall use the scaled-unbalanced data to train the tuned model with hyper-parameters.
• solver=’svd’
This is the algorithm to use in the optimization problem.
• tol=0.0001
This is the tolerance value for the stopping criteria.
The mean accuracy of the model built using the scaled-unbalanced data and the scaled-
balanced datasets are as follows.
26
Machine Learning - Project
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)
1.5.1 KNN Model
K-Nearest Neighbors or KNN is a supervised learning technique used for classification and
regression. It considers the K nearest data points (neighbors) to predict the class or continuous
value.
KNN Model Step1- Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
KNN Model Step2- Model Build
The logistic regression model is constructed using the function KNeighborsClassifier from the
sklearn.neighbors library. Arguments passed for this functions are as below:
• n_neighbors=5
This is the number of neighbors that is used by default i.e. the k value.
• weights= “uniform”
This is the default weight function that is used in prediction.
• algorithm= “auto”
The algorithm that is used to compute the nearest neighbors. “auto” option decides the
most appropriate algorithm based on input values.
The mean accuracy of the model built using the scaled-unbalanced data and the scaled-
balanced datasets are as follows.
27
Machine Learning - Project
• var_smoothing = 1e-9
Portion of the largest variance of all features that is added to variances for calculation
stability.
The mean accuracy of the model built using the scaled-unbalanced data and the scaled-
balanced datasets are as follows.
28
Machine Learning - Project
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and
Boosting. (7 marks)
1.6.1 Bagging Models
Bagging Classifier Model
A Bagging classifier is an ensemble technique that fits base classifiers each on random subsets of
the original dataset. It then aggregates their individual predictions (either by voting or by
averaging) to form a final prediction.
1.6.1.1.1 Bagging Classifier Model Step1 – Data Split
The dataset has to be divided into the training and test subsets. This has been covered in section
1.3.2 Spilt of Data – Train and Test.
1.6.1.1.2 Bagging Classifier Model Step2 –Model Build
The classifier is constructed using the BaggingClassifier function that is part of the library -
sklearn.ensemble. Arguments passed for this functions are as below:
29
Machine Learning - Project
2. The score of the model is good for both the test and training datasets indicating good
model performance for the prediction of the output.
3. The model trained with unscaled unbalanced data has performed better than the other
models.
4. We shall use the unscaled-unbalanced data to train the tuned model with hyper-
parameters.
• criterion = 'gini'
The function to measure the quality of a split. Gini’ is the default value of the criterion
argument so it does not have to be explicitly specified.
• n_estimators = 500
The number of trees that are in the forest. maximum depth of the tree. In general, a
higher number of trees increases the performance and makes the predictions more
stable, but it also slows down the computation.
• oob_score=True
This value decides whether to use out-of-bag samples to estimate the generalization
score.
• max_features=5
The number of features to consider when looking for the best split.
• max_depth = 8
The maximum depth of the tree. If no input is specified, then nodes are expanded until
all leaves are pure or until all leaves contain less than “min_samples_split” samples.
• min_samples_leaf=20
30
Machine Learning - Project
The minimum number of samples required to be at a leaf node. A split point at any depth
will only be considered if it leaves at least “min_samples_leaf” training samples in each
of the left and right branches.
Generally, this value will be 1% to 3% of the total number of data points.
• min_samples_split=60
The minimum number of samples required to split an internal node.
Generally, this value will be three times the value set to “min_samples_leaf.
• random_state=1
This makes the model’s output replicable. The model will always produce the same results
when it has a definite value of random_state and if it has been given the same
parameters and the same training data.
The constructed model is then used to fit the training dataset in order to complete the model
training operation.
1.6.1.2.3 RF Modelling Step3 – Checking Features’ Importance and OOB Score
The generated model gives the importance of each of the features that will impact the output.
Out of bag (OOB) score is a way of validating the Random forest model. It is computed as the
number of correctly predicted rows from the out-of-bag sample. Both these data are tabulated
below.
We can see that Blair, Hague and Europe features hold the most influence as compared to the
other features. The order of importance of Blair and Hague are interchanged between the two
tabulated outputs.
The accuracy of the model built using the scaled-unbalanced data and the scaled-balanced
datasets are as follows.
31
Machine Learning - Project
• n_estimators = 500
The number of boosting stages to perform.
32
Machine Learning - Project
• random_state=1
Controls the random seed given at each `base_estimator` at each boosting iteration.
The constructed model is then used to fit the training dataset in order to complete the model
training operation. The mean accuracy of the model built using the scaled-unbalanced data and
the scaled-balanced datasets are as follows.
33
Machine Learning - Project
• n_estimators = 500
The number of trees that are in the forest. maximum depth of the tree. In general, a
higher number of trees increases the performance and makes the predictions more
stable, but it also slows down the computation.
• random_state=1
Controls the random seed given to each Tree estimator at each boosting iteration.
The constructed model is then used to fit the training dataset in order to complete the model
training operation. The accuracy of the model built using the scaled-unbalanced data and the
scaled-balanced datasets are as follows.
• Unscaled-Unbalanced data
• Scaled-Unbalanced Data
• Scaled-Balanced Data
34
Machine Learning - Project
Based on the scores for this data, the data producing the best results is selected and then used
for building a tuned mode.
After the GridSearchCV function execution is complete, below is the set of best selected
parameters for the model.
{'l1_ratio': 0.25,
'max_iter': 10000,
'penalty': 'l1',
'solver': 'saga',
'tol': 0.01}
35
Machine Learning - Project
Conclusion: The tuned model has performed slightly better than the simple default model.
After the GridSearchCV function execution is complete, below is the set of best selected
parameters for the model.
{'solver': 'svd',
'tol': 1e-05}
Conclusion: The tuned model has not given better results as compared to the simple model.
36
Machine Learning - Project
As per the above tables and the plot, we deduce that the optimum value for K is 9. We build a
model with the K=9.
1.6.3.4.1 KNN Model - Simple Model Vs Tuned Model
We compare these results with the KNN model that was previously created in 1.5.1 KNN Model.
1. The scores for the train data deteriorated a little but the score for the test dataset has
improved.
2. The GridSearchCV operation have given positive results.
37
Machine Learning - Project
Conclusion: The tuned model has performed better than the simple default model.
After the GridSearchCV function execution is complete, the set of best selected parameters for
the model is as follows.
{'var_smoothing': 0.16114142772530193}
Conclusion: The tuned model has performed slightly better than the simple default model.
38
Machine Learning - Project
The parameters with base_estimator__ are options for the DecisionTreeClassifier model
which is the base estimator input to the BaggingClassifier. After the function execution is
complete, we check the best selected parameters from this.
{'base_estimator__max_depth': 4,
'base_estimator__min_samples_leaf': 15,
'base_estimator__min_samples_split': 45,
'n_estimators': 175}
Conclusion: The simple model has performed slightly better for the test data but as the tuned
model is not over-fitted, the tuned model is better.
After the function execution is complete, we check the best selected parameters from this.
{'max_depth': 6,
'max_features': 4,
'min_samples_leaf': 25,
'min_samples_split': 60,
'n_estimators': 300}
39
Machine Learning - Project
With these values set, we recheck the features’ importance. There is a slight change in the
feature importance as compared to the one calculated in Table 1-26 RF-Computed Importance
for All Features.
The top three feature importance have remained the same.
Columns Importance
Hague 0.3517
Blair 0.2630
Europe 0.2141
National Economic Cond. 0.0666
Political Knowledge 0.0553
Age 0.0321
Household Economic Cond. 0.0130
Gender 0.0044
Table 1-37 RF-Computed Importance for All Features – GridSearchCV Best Parameters
Conclusion: The tuned model has performed slightly better as compared to the simple model.
Post execution of GridSearchCV, we check the best selected parameters from this.
40
Machine Learning - Project
Conclusion: The tuned model has not given better results as compared to the simple model.
Post execution of GridSearchCV, we check the best selected parameters from this.
{'base_estimator__max_depth': 4,
'base_estimator__min_samples_leaf': 15,
'base_estimator__min_samples_split': 45,
'n_estimators': 175}
41
Machine Learning - Project
Conclusion: The tuned model has performed better than the simple default model.
Train
Model Test Score Best Parameters (when applicable)
Score
Logistic Regression Scaled 83.13 83.55
Logistic Regression Smote 83.36 80.7
{'l1_ratio': 0.25,
'max_iter': 10000,
GridSearchCV Logistic
83.13 82.89 'penalty': 'l1',
Regression
'solver': 'saga',
'tol': 0.01}
42
Machine Learning - Project
{'base_estimator__max_depth': 4,
'base_estimator__min_samples_le
GridSearchCV Bagging af': 15,
84.44 80.70
Classifier 'base_estimator__min_samples_sp
lit': 45,
'n_estimators': 175}
43
Machine Learning - Project
1.7 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score
for each model. Final Model: Compare the models and write inference which
model is best/optimized. (7 marks)
Using the best models generated (see section 1.6.4 All Model Scores) detailed performance
parameters are computed and compared. The performance parameters are the confusion matrix,
the classification report, AUC score and the ROC curve.
44
Machine Learning - Project
Using the confusion matrix, the metrics accuracy, precision, recall and specificity are derived.
1.7.1.1.1 Accuracy
Accuracy (ACC) is the number of all correct predictions divided by the total number of the dataset.
The best accuracy is 1.0, whereas the worst is 0.0
(𝑇𝑃 + 𝑇𝑁)
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
(𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁)
Formula 1-2 Confusion Matrix – Accuracy
Accuracy is not the best metric to be checked especially if there is an imbalanced dataset. In such
cases, accuracy metric does not give correct understanding. In order to mitigate this, we use the
additional metrics of precision and recall.
1.7.1.1.2 Precision
Precision (PREC) is calculated as the number of correct positive predictions divided by the total
number of positive predictions. It tells us how may correctly predicted true cases are actually
positive.
It is also called positive predictive value (PPV). The best precision is 1.0, whereas the worst is 0.0.
Precision is a useful metric in cases where False Positive is a higher concern than False Negatives
(e.g.: In e-commerce recommendations, wrong results could lead to customer churn).
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
(𝑇𝑃 + 𝐹𝑃)
Formula 1-3 Confusion Matrix - Precision
1.7.1.1.3 Recall/Sensitivity
Recall is calculated as the number of correct positive predictions divided by the total number of
positives i.e. the actual positive cases we were able to predict correctly with our model.
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 =
(𝑇𝑃 + 𝐹𝑁)
Formula 1-4 Confusion Matrix - Recall
It is also referred to as the true positive rate (TPR). The best recall is 1.0, whereas the worst is
0.0. Recall is a useful metric in cases where False Negatives is a higher concern than False
Positives (e.g.: In medical diagnosis raising a false alarm may be safer).
1.7.1.1.4 Specificity
Specificity is calculated as the number of correct negative predictions divided by the total number
of negatives.
45
Machine Learning - Project
𝑇𝑁
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =
(𝑇𝑁 + 𝐹𝑃)
Formula 1-5 Confusion Matrix - Specificity
It is also referred to as the true negative rate (TNR). The best recall is 1.0, whereas the worst is
0.0.
1.7.1.1.5 F1 Score
Recall and Precision metrics are inversely proportional to each other. The best way to capture
the trend is to use a combination of both which gives us the F1-Score metric. The F1 score is a
weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is
0.0.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑇𝑃
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ∙ =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 1
𝑇𝑃 + (𝐹𝑃 + 𝐹𝑁)
2
Formula 1-6 Confusion Matrix – F1 Score
The interpretability of the F1-score is poor on its own. Using it in combination with other
evaluation metrics which gives us a complete picture of the result.
1.7.1.1.6 Classification Report
This report displays the precision, recall, F1, and support scores for the created model. It is
generated by the classification_report function of the sklearn library. A sample report is shown
below:
precision recall f1-score support
0 0.78 0.91 0.84 300
1 0.71 0.47 0.57 600
accuracy 0.77 900
macro avg 0.75 0.69 0.70 900
weighted avg 0.76 0.77 0.75 900
Table 1-42 Sample Classification Report
46
Machine Learning - Project
2. We also check the other scores and hope to see them fairly balanced. If a party is going to
use this model in areas to predict its chances for victory, recall is better metric to track. Recall
will let us know hoe accurately the model is able to identify the party that the voter will
choose.
4. AUC is checked to see if the value is high. The curve shape is also checked to see if it is
extending up to the top left corner
• 0 indicates Conservative
• 1 indicates Labour
47
Machine Learning - Project
Figure 1-27 Log. Reg. Training Data Confusion Matrix Figure 1-28 Log. Reg. Test Data Confusion Matrix
48
Machine Learning - Project
Figure 1-29 Log. Reg. Training Data ROC-AUC Curve Figure 1-30 Log. Reg. Test Data ROC-AUC Curve
• Accuracy is fairly high for the training data prediction • Accuracy is fairly high for the test data prediction
• Recall value and F1-Score is high for only Labour as compared • Recall value and F1-Score is much better for both Labour as
to Conservative. and Conservative as compared to the training data scores.
• AUC score indicates that the model is good • AUC score indicates that the model is good
Table 1-45 Log. Reg. Metrics – Train and Test Data
49
Machine Learning - Project
Figure 1-31 LDA Training Data Confusion Matrix Figure 1-32 LDA Test Data Confusion Matrix
50
Machine Learning - Project
Figure 1-33 LDA Training Data ROC-AUC Curve Figure 1-34 LDA Test Data ROC-AUC Curve
• Accuracy is fairly high for the training data prediction • Accuracy is fairly high for the test data prediction
• Recall value and F1-Score is high for only Labour as compared • Recall value and F1-Score is much better for both Labour as
to Conservative. and Conservative as compared to the training data scores.
• AUC score indicates that the model is good • AUC score indicates that the model is good
Table 1-48 LDA Metrics – Train and Test Data
51
Machine Learning - Project
Figure 1-35 KNN Training Data Confusion Matrix Figure 1-36 KNN Test Data Confusion Matrix
52
Machine Learning - Project
Figure 1-37 KNN Training Data ROC-AUC Curve Figure 1-38 KNN Test Data ROC-AUC Curve
• Accuracy is high for the training data prediction • Accuracy has improved for the test data prediction
• Recall value and F1-Score is high for both Labour and • Recall value and F1-Score is much better for both Labour as
Conservative. and Conservative as compared to the training data scores.
• AUC score indicates that the model is very good • AUC score indicates that the model is very good
Table 1-51 KNN Metrics – Train and Test Data
53
Machine Learning - Project
Figure 1-39 Naïve Bayes Training Data Confusion Matrix Figure 1-40 Naïve Bayes Test Data Confusion Matrix
54
Machine Learning - Project
Figure 1-41 Log. Reg. Training Data ROC-AUC Curve Figure 1-42 Naïve Bayes Test Data ROC-AUC Curve
• Accuracy is high for the training data prediction • Accuracy is high for the test data prediction
• Recall value and F1-Score is high for only Labour as compared • Recall value and F1-Score is better for only Conservative and
to Conservative. has slightly reduced for Labour as compared to the training
• AUC score indicates that the model is good data scores.
• AUC score indicates that the model is good
Table 1-54 Naïve Bayes Metrics – Train and Test Data
55
Machine Learning - Project
Figure 1-43 Bagging Classification Training Data Confusion Matrix Figure 1-44 Bagging Classification Test Data Confusion Matrix
56
Machine Learning - Project
Figure 1-45 Bagging Classification Training Data ROC-AUC Curve Figure 1-46 Bagging Classification Test Data ROC-AUC Curve
• Accuracy is fairly high for the training data prediction • Accuracy has reduced for the test data prediction
• Recall value and F1-Score is high for Labour and is decent for • Recall value and F1-Score is lower for both Labour as and
Conservative. Conservative as compared to the training data scores.
• AUC score indicates that the model is good • AUC score indicates that the model is good
Table 1-57 Bagging Classification Metrics – Train and Test Data
57
Machine Learning - Project
Figure 1-47 Random Forest Bagging Training Data Confusion Matrix Figure 1-48 Random Forest Bagging Test Data Confusion Matrix
58
Machine Learning - Project
Figure 1-49 Random Forest Bagging Training Data ROC-AUC Curve Figure 1-50 Random Forest Bagging Test Data ROC-AUC Curve
• Accuracy is high for the training data prediction • Accuracy has reduced for test data but is still is fairly high for
• Recall value and F1-Score is high for only Labour as compared the test data prediction
to Conservative. • Recall value and F1-Score is has reduced for both Labour as
• AUC score indicates that the model is good and Conservative as compared to the training data scores.
• AUC score indicates that the model is good
Table 1-60 Random Forest Bagging Metrics – Train and Test Data
59
Machine Learning - Project
Figure 1-51 Ada Boost Training Data Confusion Matrix Figure 1-52 Ada Boost Test Data Confusion Matrix
60
Machine Learning - Project
Figure 1-53 Ada Boost Training Data ROC-AUC Curve Figure 1-54 Ada Boost Test Data ROC-AUC Curve
• Accuracy is very high for the training data prediction • Accuracy is high for the test data prediction
• Recall value and F1-Score is high for Labour and • Recall value and F1-Score has reduced for both Labour as and
Conservative. Conservative as compared to the training data scores but still
• AUC score indicates that the model is good is high.
• AUC score indicates that the model is good
Table 1-63 Ada Boost Metrics – Train and Test Data
61
Machine Learning - Project
Figure 1-55 Gradient Boost Training Data Confusion Matrix Figure 1-56 Gradient Boost Test Data Confusion Matrix
62
Machine Learning - Project
Figure 1-57 Gradient Boost Training Data ROC-AUC Curve Figure 1-58 Gradient Boost Test Data ROC-AUC Curve
• Accuracy is very high for the training data prediction • Accuracy is high for the test data prediction
• Recall value and F1-Score is high for Labour and • Recall value and F1-Score has reduced for both Labour as and
Conservative. Conservative as compared to the training data scores but still
• AUC score indicates that the model is good is high.
• AUC score indicates that the model is good
Table 1-66 Gradient Boost Metrics – Train and Test Data
63
Machine Learning - Project
64
Machine Learning - Project
65
Machine Learning - Project
66
Machine Learning - Project
The ROC curve for training data shows best scores for Ada Boost and Gradient Boost models.
67
Machine Learning - Project
The ROC curve for test data is almost same for all models, best scores is seen for Gradient Boost
and KNN models.
68
Machine Learning - Project
1. The data of positive government actions from constituencies where the Conservative
party won should be taken to understand what appealed to the citizens there.
Propagating these positive work can help change the tide of victory for the Conservative
party in the future.
In similar fashion, data collection should be done in the same constituency by the Labour
party to identify their shortcomings and general public opinion.
2. High Europe skepticism should be understood and handled. This may be a key factor in
ensuring victory in the next elections.
3. Labour party has a high number of people with 0 political knowledge score. Addressing
this issue may cause a shift in public opinion.
69
Machine Learning - Project
2 Problem 2 Statement
In this particular project, we are going to work on the inaugural corpus from the NLTK in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned
documents. – 3 Marks
2.1.1 Number of Characters
The number of characters in each speech is computed with the raw function of the
nltk.inaugural library. This function returns the whole speech as a single string.
Applying the len function on this output gives us the number of characters in each speech. The
speech samples as returned by the raw function along with the count of characters has been
tabulated below.
The same data has been plotted for easy visual comparison.
70
Machine Learning - Project
71
Machine Learning - Project
In similar fashion as character count, the 1973-Nixon speech holds the most number of words. It
has more than 500 words as compared to the other speeches.
The 1941-Roosevelt and 1962-Kennedy speeches are very close in word count.
The 1973-Nixon and 1941-Roosevelt sentence counts are very close to each other. The 1962-
Kennedy speech has lesser number of sentences as compared to the other two speeches.
72
Machine Learning - Project
2.2 Remove all the stop words from all three speeches. – 3 Marks
2.2.1 Lower Case Words
Before any stemming or stop words removal is undertaken, all the words in each speech are
converted to lower case. This helps in correct identification as “The” is not same as “the”.
Removing of case dependency by making all words lower case has made the speech analysis more
streamlined.
73
Machine Learning - Project
Conclusion on Stemming
In this case, the words of the speeches will be used to create word clouds. Taking this into
account, we are not going to use the stemmed data output in further operations.
Speech Count Before Stop Words Clearance Count After Stop Words Clearance
1941-Roosevelt 1536 657
1961-Kennedy 1546 722
1973-Nixon 2028 853
Table 2-4 Speech Output Post Cleanup
74
Machine Learning - Project
The differences in the words before and after stop words cleanup have been highlighted. The
words/punctuation marks which are highlighted in red have been removed post the clean
operation.
1941-Roosevelt – Before and After Stop Words Cleanup
Before:
'on', 'each', 'national', 'day', 'of', 'inauguration', 'since', '1789', ',', 'the', 'people', 'have',
'renewed', 'their', 'sense', 'of', 'dedication', 'to', 'the', 'united', 'states', '.', 'in',
'washington', "'", 's', 'day', 'the', 'task', 'of', 'the', 'people', 'was', 'to', 'create', 'and',
'weld', 'together', 'a', 'nation', '.',
After:
'national', 'day', 'inauguration', 'since', '1789', 'people', 'renewed', 'sense', 'dedication',
'united', 'states', 'washington', 'day', 'task', 'people', 'create', 'weld', 'together', 'nation',
75
Machine Learning - Project
After:
'vice', 'president', 'johnson', 'mr', 'speaker', 'mr', 'chief', 'justice', 'president', 'eisenhower',
'vice', 'president', 'nixon', 'president', 'truman', 'reverend', 'clergy', 'fellow', 'citizens',
'observe', 'today', 'victory', 'party', 'celebration', 'freedom', '--', 'symbolizing', 'end',
2.3 Which word occurs the most number of times in his inaugural address for
each president? Mention the top three words. (after removing the stop
words) – 3 Marks
2.3.1 Most Frequent Words in Cleaned Speech
We see that the words “--“, and “let” have high frequency. These do not give any useful
information. For this reason, we add these words to the stop-words list and clean the data again.
76
Machine Learning - Project
“let” 22
“america 21
Table 2-5 Words with Top Frequency
2.4 Plot the word cloud of each of the speeches of the variable. (after removing
the stop words) – 3 Marks
Word clouds are cloud like depictions of words in which the more frequency with which a specific
word appears in the source of textual data, the bigger and bolder it appears in the word cloud.
The three speeches which have been cleaned of the stop-words have their word clouds
constructed and these images are shown below. The words in the word cloud are marked in the
speech descriptions.
77
Machine Learning - Project
78
Machine Learning - Project
79
Machine Learning - Project
80