Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021
LEARNING
PROJECT
1|Page
Table of Contents
1 Problem 1- 3
a) Problem 1.1 3
b) Problem 1.2 6
c) Problem 1.3 13
d) Problem 1.4 14
e) Problem 1.5 19
f) Problem 1.6 24
g) Problem 1.7 40
h) Problem 1.8 42
2 Problem 2- 43
a) Problem 2.1 43
b) Problem 2.2 44
c) Problem 2.3 45
d) Problem 2.4 46
2|Page
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyse recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered by a particular
party.
Data Dictionary:
1.1. Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it.
Solution-
As First step we have imported all necessary libraries and then Excel file (Election_Data.xlsx)
is read in python for further data analysis.
After uploading the data, we checked the data using head(). Post that we dropped unnamed
column.
Data consists of 1525 rows and 9 unique attributes as Vote, age, economic.cond.national, ec
onomic.cond.household, Blair, Hague, Europe, political.knowledge, gender
3|Page
From the above output we can see that there are 7 integer data types and 2 object datatypes.
There are no null values in the given data.
We can see a clear class imbalance in our target variable i.e. ‘vote’. Labour class constitute
70% of vote where as conservative class have only 30% votes.
Describe for numerical and categorical columns- The describe() method is used for
calculating some statistical data like percentile, mean and std of the numerical values of the
Series or DataFrame.
4|Page
Then we checked for duplicates values in the data using duplicate() function and found that
there are 8 duplicate values in that.
Here, we used drop function and dropped all duplicated values from the data. Now we can
see that are no duplicates in the data.
We also checked for skewness in data. Only Hague and Age are positively skewed rest are
negatively skewed.
5|Page
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers.
Solution-
Exploratory data analysis: - EDA is an approach to analysing data sets to summarize their main
characteristics, often using statistical graphics and other data visualization method.
Univariate Analysis
We have used distplot and boxplot for Univariate analysis. It helps better understanding the
distribution of the data and visualizing outliers as well as quartiles positions.
6|Page
7|Page
From the above graphs we can see that there are outliers in economic and economic
household boxplot.
As per above distribution we can see that most of the voters have moderate national
economic condition.
Blair has been given higher rating by voters as compared to Hague.
None of the attributes gave normal distribution.
European Distribution is left skewed and it shows that most of the participants are aware of
parties’ position on European integration and have ‘Eurosceptic’ sentiment.
Bivariate Analysis-
This is performed to find the relationship between each variable in the dataset and the target variable
of interest (or) using 2 variables and finding the relationship between them.
From the below bar plot we can see that distribution of male and female are close to equal in
Labour class where as in conservative class there are more no of male voters as compared to
female ones.
8|Page
From the below plot we can see that most of the voters / particiants falls under scale of 3 and
4 national economic condition. Also we can observe that maximum no.of voters are from the
age group of 30 to 70.
9|Page
We can see that none of the variables are highly correlated.
10 | P a g e
Plotted a heatmap. It contains values representing various shades of the same colour for each
value to be plotted. Usually the darker shades of the chart represent higher values than the
lighter shade. We can below most of the features are not highly correlated.
This no strong multicollinearity among variables.
Ratings of household economic condition and national economic condition have maximum
correlation in the whole data i.e. 0.35.
Blair and national economic condition have a correlation of 0.33.
11 | P a g e
From the above box plots, we can see there are outliers present in economic condition
household and national economic condition. As we know, Machine learning algorithm are
sensitive of outliers. Hence, we need to treat them before further analysis.
12 | P a g e
These outliers value needs to be treated and there are several ways of treating them by
dropping the outlier values or by replacing the outlier value using the IQR.
Here, we used IQR method to treat the outliers in given data. From the below plot we can see
that outliers in the data are treated.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or
not? Data Split: Split the data into train and test (70:30).
Solution- Here, before splitting the data we created dummy variables for categorical
variables.
Scaling the variables as continuous variables have different weightage using min-max
technique.
13 | P a g e
The concept of standardization/scaling comes into picture when continuous independent
variables are measured at different scales. It means these variables do not give equal
contribution to the analysis. When the range of values are very distinct in each column, we
need to scale them to the common level. The values are brought to common level and then
we can use data for further analysis.
Then we copy all predictor variables into X dataframe and copy target into the y dataframe.
Split X and y into training and test set in 70:30 ratio.
Solution-
Logistic regression is a linear model for classification rather than regression. It is also known
as logit regression. In this model, the probabilities describing the possible outcomes of a
single trial are modelled using a logistic function.
Note: Regularization is applied by default, which is common in machine learning but not in
statistics. Another advantage of regularization is that it improves numerical stability. No
regularization amounts to setting C to a very high value.
Predicting on Training and Test dataset and Getting the Predicted Classes and Probs.
14 | P a g e
Performance matrix on test data
15 | P a g e
Confusion matrix on training data
16 | P a g e
Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) algorithm for classification predictive modeling problems.
LDA Consists of statistical properties of your data, calculated for each class. For a single
input variable (x) this is the mean and the variance of the variable for each class. For multiple
variables, this is the same properties calculated over the multivariate Gaussian, namely the
means and the covariance matrix.
17 | P a g e
AUC and ROC on test data
18 | P a g e
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
Solution-
Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’
theorem with the “naive” assumption of conditional independence between every pair of
features given the value of the class variable. Bayes’ theorem states the following
relationship, given class variable y and dependent feature vector x1 through xn.
We imported GaussianNB from sklearn naïve bayes package and fit the model into training
dataset.
19 | P a g e
We predicted the model score.
20 | P a g e
Confusion matrix on training data
K-Nearest Neighbors
Source: scikit-learn
21 | P a g e
We imported KNeighborsClassifier from sklearn.neighbors and fit the model into training
dataset.
22 | P a g e
AUC and ROC on Test data
23 | P a g e
Confusion matrix on test data
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and
Boosting.
Solution-
Random Forest
In random forests, each tree in the ensemble is built from a sample drawn with replacement
(i.e., a bootstrap sample) from the training set
Furthermore, when splitting each node during the construction of a tree, the best split is found
either from all input features or a random subset of size max_features
The purpose of these two sources of randomness is to decrease the variance of the forest
estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit.
The injected randomness in forests yield decision trees with somewhat decoupled prediction
errors. By taking an average of those predictions, some errors can cancel out. Random
forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a
slight increase in bias. In practice the variance reduction is often significant hence yielding an
overall better model
Source: scikit-learn
We imported RandomForestClassifier from sklearn ensemble and fit the model into training d
ata set.
24 | P a g e
Performance Metrics: Random Forest
Here we can see the model is performing very well in training data set. But not performing
well on test data set. Accuracy and F1 score in test data has dropped.
25 | P a g e
Confusion matrix on training data
Bagging Classifier
A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random
subsets of the original dataset and then aggregate their individual predictions (either by voting
or by averaging) to form a final prediction.
26 | P a g e
AOC & ROC on training data
27 | P a g e
Confusion matrix on test data
Ada Boosting
The module sklearn.ensemble includes the popular boosting algorithm AdaBoost, introduced
in 1995 by Freund and Schapire
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only
slightly better than random guessing, such as small decision trees) on repeatedly modified
versions of the data. The predictions from all of them are then combined through a weighted
majority vote (or sum) to produce the final prediction.
The number of weak learners is controlled by the parameter n_estimators. The learning_rate
parameter controls the contribution of the weak learners in the final combination. By default,
weak learners are decision stumps. Different weak learners can be specified through the
base_estimator parameter. The main parameters to tune to obtain good results are
n_estimators and the complexity of the base estimators (e.g., its depth max_depth or
minimum required number of samples to consider a split min_samples_split).
Source: scikit-learn
28 | P a g e
Performance Metrics: Ada Boosting
29 | P a g e
Confusion matrix on Training data
30 | P a g e
Gradient Boosting
Gradient boosting is a machine learning technique for regression and classification problems,
which produces a prediction model in the form of an ensemble of weak prediction models,
typically decision trees.
31 | P a g e
AUC & ROC on test data
32 | P a g e
Confusion matrix on test data
Model Tuning
By importing GridSearchCV from sklearn selection model we fit the model into data.
33 | P a g e
Logistic Regression - Grid search
We fit the model using grid search into data and find the best parameters and estimator.
34 | P a g e
AUC and ROC on training data
35 | P a g e
AUC and ROC on test data
We fit the model using grid search into data and find the best parameters and estimator.
36 | P a g e
Confusion and Classification matrix on the training data
37 | P a g e
AUC & ROC on Test Data
We build the Gaussian Naive Bayes using grid search and fit the model into data.
38 | P a g e
Confusion and Classification matrix on the training data
39 | P a g e
AUC & ROC on test Data
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is
best/optimized.
Solution-
Both Logit and LDA has performed well on training and test data.
This also implies that the model is not under over fitted in both the scenarios.
Accuracy score on both the model is 0.83
AUC score on both the model is 0.89
For both the model F1 Score and Recall score is better in Labour party as compared to
Conservative party.
40 | P a g e
Accuracy Precision F1 Score Recall AUC
Model Class Train Test Train Test Train Test Train Test Train Test
Conservative 0.83 0.83 0.75 0.76 0.69 0.74 0.64 0.73 0.89 0.89
Logit Labour 0.83 0.83 0.86 0.86 0.89 0.87 0.91 0.88 0.89 0.89
Conservative 0.83 0.83 0.74 0.76 0.69 0.74 0.65 0.73 0.89 0.89
LDA Labour 0.83 0.83 0.86 0.86 0.89 0.87 0.91 0.88 0.89 0.89
In Naïve Bayes Model Accuracy on training set is 0.83 and test set is 0.82. Scores are very
close which indicates the model is performing well on train and test set and there is no case
over or under fitting.
AUC score is same in both train and test set i.e. 0.90 for NB Model.
In KNN Model Accuracy on training set is 0.86 and test set is 0.83. Scores are very close
which indicates the model is performing well on train and test set and there is no case over or
under fitting.
AUC score is same in both train and test set i.e. 0.93 for KNN Model.
For both the model F1 Score, Precision and Recall score is better in Labour party as
compared to Conservative party.
If we compare both the model, we can see that KNN is performing better than Naïve Bayes.
In Simple Random forest, we can see that scores are 100% in Precision, Recall, Accuracy, F1
and AUC for training data set. However, scores are poor in Test set. This is a clear example
of overfitting case.
We have then used Random forest using Bagging technique. Now, we can see that the
difference in Training and test set has improved slightly.
In Adaptive boosting, accuracy score for Train set is 0.85 and test set is 0.81.
AUC score for Ada boosting for both train and test set is 0.92.
In gradient boosting, accuracy score for Train set is 0.89 and test set is 0.83.
AUC score for gradient boosting for both train and test set is 0.95.
For all 4 models F1 Score, Precision and Recall score is better in Labour party as compared
to Conservative party.
41 | P a g e
Accuracy Precision F1 Score Recall AUC
Model Class Train Test Train Test Train Test Train Test Train Test
Conservative 1 0.83 1 0.78 1 0.73 1 0.69 1 1
RF Labour 1 0.83 1 0.85 1 0.88 1 0.9 1 1
Conservative 0.97 0.83 0.98 0.78 0.94 0.73 0.9 0.68 1 1
Bagging Labour 0.97 0.83 0.96 0.85 0.98 0.88 0.99 0.9 1 1
Ada Conservative 0.85 0.81 0.76 0.75 0.73 0.71 0.7 0.67 0.92 0.92
Boosting Labour 0.85 0.81 0.88 0.84 0.9 0.86 0.91 0.88 0.92 0.92
Gradient Conservative 0.89 0.83 0.84 0.79 0.81 0.73 0.78 0.68 0.95 0.95
Boosting Labour 0.89 0.83 0.91 0.85 0.93 0.88 0.94 0.91 0.95 0.95
Grid search is the simplest algorithm for hyperparameter tuning. Basically, we divide the
domain of the hyperparameters into a discrete grid. Then, we try every combination of values
of this grid, calculating some performance metrics using cross-validation.
In LDA –Grid search, Accuracy score is 0.83 and AUC is 0.89 for both train and test set.
In Logit –Grid search, Accuracy score is 0.84 in training data and test is 0.83 and AUC is 0.89
for both train and test set. This shows that the model is performing well and there is no
overfitting case
In NB –Grid search, Accuracy score is 0.83 in training and test is 0.82 and AUC is 0.90 for
both train and test set.
In KNN –Grid search, Accuracy score is 0.84 and AUC is 0.90 for both train and test set.
After model tuning also we can see a drastic change in the scores.
Accuracy Precision F1 Score Recall AUC
Model Class Train Test Train Test Train Test Train Test Train Test
LDA- Grid Conservative 0.83 0.83 0.74 0.76 0.69 0.75 0.65 0.74 0.89 0.89
Search Labour 0.83 0.83 0.87 0.87 0.88 0.88 0.9 0.88 0.89 0.89
Logit- Grid Conservative 0.84 0.83 0.76 0.76 0.69 0.73 0.63 0.71 0.89 0.89
Search Labour 0.84 0.83 0.86 0.86 0.89 0.87 0.92 0.88 0.89 0.89
KNN-Grid Conservative 0.84 0.84 0.76 0.8 0.71 0.74 0.66 0.69 0.9 0.9
Search Labour 0.84 0.84 0.87 0.85 0.89 0.88 0.92 0.91 0.9 0.9
NV- Grid Conservative 0.83 0.82 0.72 0.74 0.71 0.73 0.69 0.73 0.9 0.9
Search Labour 0.83 0.82 0.88 0.87 0.88 0.87 0.89 0.87 0.9 0.9
Solution-
All model which have performed well on training data set, have also performed well on the
test set.
All the model except, Random forest and bagging are good to be used for future predictions.
Tuned models are not that well as compared to the basic ones. So, for predictions top 2
model can be KNN and Gradient Boosting.
In this case Accuracy scores plays an important role. As for use case, true positive and true
negative predictions are essential to predict election poll.
If we compare all the models Accuracy score in KNN and Gradient Boosting model is good in
both train and test set, which can be used for future predictions.
There is no overfitting case except for Random Forest and Bagging.
42 | P a g e
Labour party supporters there across all age group and because of this there can be a
positive perception about a strong economic condition. This can attract more voters in future.
As there are many factors like household economic conditions, national economic conditions,
‘Eurosceptic’ sentiment, political knowledge etc. This will impact the decision of voters.
During Bivariate analysis we have observed overall no.of of female voters is higher than male
voters. Parties can try to attract male supporters which can increase their vote banks.
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Solution-
43 | P a g e
2.2 Remove all the stopwords from all three speeches.
Solution- First we imported all required packages and then we’ll clean the speeches.
It helps to get rid of unhelpful parts of the data, or noise, by converting all characters to
lowercase, removing punctuations marks, and removing stop words and typos.
Removal of Punctuation
Roosevelt Speech
44 | P a g e
Kennedy Speech
Nixon Speech
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (After removing the stopwords).
Solution-
2.4 Plot the word cloud of each of the speeches of the variable. (After removing the
stopwords)
Solution-
45 | P a g e
Word Cloud for Kennedy speech (after cleaning)!!
46 | P a g e
Word Cloud for Nixon speech (after cleaning)!!
----------------------------------------------------------THE END-----------------------------------------------------
47 | P a g e