Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Machine Learning Project: Name-Rasmita Mallick Date - 5 September 2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47



Name- Rasmita Mallick

Date- 5th September 2021

Table of Contents

SL. No Heading Page No

1 Problem 1- 3
a) Problem 1.1 3
b) Problem 1.2 6
c) Problem 1.3 13
d) Problem 1.4 14
e) Problem 1.5 19
f) Problem 1.6 24
g) Problem 1.7 40
h) Problem 1.8 42

2 Problem 2- 43
a) Problem 2.1 43
b) Problem 2.2 44
c) Problem 2.3 45
d) Problem 2.4 46

Problem 1:

You are hired by one of the leading news channels CNBE who wants to analyse recent
elections. This survey was conducted on 1525 voters with 9 variables. You have to build a
model, to predict which party a voter will vote for on the basis of the given information, to
create an exit poll that will help in predicting overall win and seats covered by a particular

Data Dictionary:

Vote Party choice: Conservative or Labour.

Age In years.
Economic.cond.national Assessment of current national economic conditions, 1 to 5.
Economic.cond.household Assessment of current household economic conditions, 1 to 5.
Blair Assessment of the Labour leader, 1 to 5.
Hague Assessment of the Conservative leader, 1 to 5.
An 11-point scale that measures respondents' attitudes toward European
integration. High scores represent ‘Eurosceptic’ sentiment.
Political.knowledge Knowledge of parties' positions on European integration, 0 to 3.
Gender Female or Male.

1.1. Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it.


 As First step we have imported all necessary libraries and then Excel file (Election_Data.xlsx)
is read in python for further data analysis.
 After uploading the data, we checked the data using head(). Post that we dropped unnamed

 Data consists of 1525 rows and 9 unique attributes as Vote, age, economic.cond.national, ec
onomic.cond.household, Blair, Hague, Europe, political.knowledge, gender

 From the above output we can see that there are 7 integer data types and 2 object datatypes.
 There are no null values in the given data.

 Unique values for categorical variables.

 We can see a clear class imbalance in our target variable i.e. ‘vote’. Labour class constitute
70% of vote where as conservative class have only 30% votes.

 Making different list for categorical columns and numerical columns

 Describe for numerical and categorical columns- The describe() method is used for
calculating some statistical data like percentile, mean and std of the numerical values of the
Series or DataFrame.

 Then we checked for duplicates values in the data using duplicate() function and found that
there are 8 duplicate values in that.

 Here, we used drop function and dropped all duplicated values from the data. Now we can
see that are no duplicates in the data.

 We also checked for skewness in data. Only Hague and Age are positively skewed rest are
negatively skewed.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for


Exploratory data analysis: - EDA is an approach to analysing data sets to summarize their main
characteristics, often using statistical graphics and other data visualization method.

Univariate Analysis

 We have used distplot and boxplot for Univariate analysis. It helps better understanding the
distribution of the data and visualizing outliers as well as quartiles positions.

 From the above graphs we can see that there are outliers in economic and economic
household boxplot.
 As per above distribution we can see that most of the voters have moderate national
economic condition.
 Blair has been given higher rating by voters as compared to Hague.
 None of the attributes gave normal distribution.
 European Distribution is left skewed and it shows that most of the participants are aware of
parties’ position on European integration and have ‘Eurosceptic’ sentiment.

Bivariate Analysis-

This is performed to find the relationship between each variable in the dataset and the target variable
of interest (or) using 2 variables and finding the relationship between them.

 From the below bar plot we can see that distribution of male and female are close to equal in
Labour class where as in conservative class there are more no of male voters as compared to
female ones.

 From the below plot we can see that most of the voters / particiants falls under scale of 3 and
4 national economic condition. Also we can observe that maximum no.of voters are from the
age group of 30 to 70.

 We can see that none of the variables are highly correlated.

10 | P a g e
 Plotted a heatmap. It contains values representing various shades of the same colour for each
value to be plotted. Usually the darker shades of the chart represent higher values than the
lighter shade. We can below most of the features are not highly correlated.
 This no strong multicollinearity among variables.
 Ratings of household economic condition and national economic condition have maximum
correlation in the whole data i.e. 0.35.
 Blair and national economic condition have a correlation of 0.33.

11 | P a g e
 From the above box plots, we can see there are outliers present in economic condition
household and national economic condition. As we know, Machine learning algorithm are
sensitive of outliers. Hence, we need to treat them before further analysis.

12 | P a g e
These outliers value needs to be treated and there are several ways of treating them by
dropping the outlier values or by replacing the outlier value using the IQR.

 Here, we used IQR method to treat the outliers in given data. From the below plot we can see
that outliers in the data are treated.

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or
not? Data Split: Split the data into train and test (70:30).

Solution- Here, before splitting the data we created dummy variables for categorical

Scaling the variables as continuous variables have different weightage using min-max

13 | P a g e
The concept of standardization/scaling comes into picture when continuous independent
variables are measured at different scales. It means these variables do not give equal
contribution to the analysis. When the range of values are very distinct in each column, we
need to scale them to the common level. The values are brought to common level and then
we can use data for further analysis.

 Then we copy all predictor variables into X dataframe and copy target into the y dataframe.
 Split X and y into training and test set in 70:30 ratio.

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).


Apply Logistic Regression

Logistic regression is a linear model for classification rather than regression. It is also known
as logit regression. In this model, the probabilities describing the possible outcomes of a
single trial are modelled using a logistic function.

Note: Regularization is applied by default, which is common in machine learning but not in
statistics. Another advantage of regularization is that it improves numerical stability. No
regularization amounts to setting C to a very high value.

 Predicting on Training and Test dataset and Getting the Predicted Classes and Probs.

Performance matrix on training data

14 | P a g e
Performance matrix on test data

AUC and ROC on training data

AUC and ROC on test data

15 | P a g e
Confusion matrix on training data

Confusion matrix on test data

16 | P a g e
Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) algorithm for classification predictive modeling problems.

LDA Consists of statistical properties of your data, calculated for each class. For a single
input variable (x) this is the mean and the variance of the variable for each class. For multiple
variables, this is the same properties calculated over the multivariate Gaussian, namely the
means and the covariance matrix.

Performance matrix on training data

Performance matrix on test data

AUC and ROC on training data

17 | P a g e
AUC and ROC on test data

Confusion matrix on training data

Confusion matrix on test data

18 | P a g e
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.


Naïve Bayes Model

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’
theorem with the “naive” assumption of conditional independence between every pair of
features given the value of the class variable. Bayes’ theorem states the following
relationship, given class variable y and dependent feature vector x1 through xn.

We imported GaussianNB from sklearn naïve bayes package and fit the model into training

Performace matrix on training data

Performace matrix on test data

19 | P a g e
We predicted the model score.

Training data accuracy socre is 0.834 and on test data is 0.822.

We can see there not much difference in training and test scores which indicates that model i
s performing well on both the sets.

AUC and ROC on training data

AUC and ROC on test data

20 | P a g e
Confusion matrix on training data

Confusion matrix on test data

K-Nearest Neighbors

Neighbors-based classification is a type of instance-based learning or non-generalizing

learning: it does not attempt to construct a general internal model, but simply stores instances
of the training data. Classification is computed from a simple majority vote of the nearest
neighbors of each point: a query point is assigned the data class which has the most
representatives within the nearest neighbors of the point.

Source: scikit-learn

21 | P a g e
We imported KNeighborsClassifier from sklearn.neighbors and fit the model into training

Performance matrix on training data

Performance matrix on test data

We predicted the model score.

Training data accuracy socre is 0.85 and on test data is 0.82.

We can see there not much difference in training and test scores which indicates that model i
s performing well on both the sets.

AUC and ROC on Training data

22 | P a g e
AUC and ROC on Test data

Confusion matrix on training data

23 | P a g e
Confusion matrix on test data

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and


Random Forest

In random forests, each tree in the ensemble is built from a sample drawn with replacement
(i.e., a bootstrap sample) from the training set

Furthermore, when splitting each node during the construction of a tree, the best split is found
either from all input features or a random subset of size max_features

The purpose of these two sources of randomness is to decrease the variance of the forest
estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit.
The injected randomness in forests yield decision trees with somewhat decoupled prediction
errors. By taking an average of those predictions, some errors can cancel out. Random
forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a
slight increase in bias. In practice the variance reduction is often significant hence yielding an
overall better model

The scikit-learn implementation combines classifiers by averaging their probabilistic prediction

Source: scikit-learn

We imported RandomForestClassifier from sklearn ensemble and fit the model into training d
ata set.

24 | P a g e
Performance Metrics: Random Forest

Here we can see the model is performing very well in training data set. But not performing
well on test data set. Accuracy and F1 score in test data has dropped.

AOC & ROC on training data

AOC & ROC on test data

25 | P a g e
Confusion matrix on training data

Confusion matrix on test data

Bagging Classifier

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random
subsets of the original dataset and then aggregate their individual predictions (either by voting
or by averaging) to form a final prediction.

Performance Matrix: Bagging

26 | P a g e
AOC & ROC on training data

AOC & ROC on test data

Confusion matrix on training data

27 | P a g e
Confusion matrix on test data

Ada Boosting

The module sklearn.ensemble includes the popular boosting algorithm AdaBoost, introduced
in 1995 by Freund and Schapire

The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only
slightly better than random guessing, such as small decision trees) on repeatedly modified
versions of the data. The predictions from all of them are then combined through a weighted
majority vote (or sum) to produce the final prediction.

The number of weak learners is controlled by the parameter n_estimators. The learning_rate
parameter controls the contribution of the weak learners in the final combination. By default,
weak learners are decision stumps. Different weak learners can be specified through the
base_estimator parameter. The main parameters to tune to obtain good results are
n_estimators and the complexity of the base estimators (e.g., its depth max_depth or
minimum required number of samples to consider a split min_samples_split).

Source: scikit-learn

28 | P a g e
Performance Metrics: Ada Boosting

AUC & ROC on training data

AUC & ROC on test data

29 | P a g e
Confusion matrix on Training data

Confusion matrix on test data

30 | P a g e
Gradient Boosting

Gradient boosting is a machine learning technique for regression and classification problems,
which produces a prediction model in the form of an ensemble of weak prediction models,
typically decision trees.

Performance Metrics: Gradient Boosting

AUC & ROC on training data

31 | P a g e
AUC & ROC on test data

Confusion matrix on training data

32 | P a g e
Confusion matrix on test data

Model Tuning

LDA- Grid Search

By importing GridSearchCV from sklearn selection model we fit the model into data.

Performance Matrix on train and test dataset

AUC and ROC Curve on training and test dataset

33 | P a g e
Logistic Regression - Grid search

We fit the model using grid search into data and find the best parameters and estimator.

Getting the probabilities on the test set

Confusion and Classification matrix on the training data

34 | P a g e
AUC and ROC on training data

Confusion and Classification matrix on the test data

35 | P a g e
AUC and ROC on test data

KNN - Grid search

We fit the model using grid search into data and find the best parameters and estimator.

36 | P a g e
Confusion and Classification matrix on the training data

Confusion and Classification matrix on the test data

AUC & ROC on training Data

37 | P a g e
AUC & ROC on Test Data

Gaussian Naive Bayes - Grid Search

We build the Gaussian Naive Bayes using grid search and fit the model into data.

38 | P a g e
Confusion and Classification matrix on the training data

Confusion and Classification matrix on the test data

AUC & ROC on training Data

39 | P a g e
AUC & ROC on test Data

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is


Logit and LDA Performance analysis

 Both Logit and LDA has performed well on training and test data.
 This also implies that the model is not under over fitted in both the scenarios.
 Accuracy score on both the model is 0.83
 AUC score on both the model is 0.89
 For both the model F1 Score and Recall score is better in Labour party as compared to
Conservative party.

40 | P a g e
Accuracy Precision F1 Score Recall AUC
Model Class Train Test Train Test Train Test Train Test Train Test
Conservative 0.83 0.83 0.75 0.76 0.69 0.74 0.64 0.73 0.89 0.89
Logit Labour 0.83 0.83 0.86 0.86 0.89 0.87 0.91 0.88 0.89 0.89
Conservative 0.83 0.83 0.74 0.76 0.69 0.74 0.65 0.73 0.89 0.89
LDA Labour 0.83 0.83 0.86 0.86 0.89 0.87 0.91 0.88 0.89 0.89

Naïve Bayes and KNN Performance analysis

 In Naïve Bayes Model Accuracy on training set is 0.83 and test set is 0.82. Scores are very
close which indicates the model is performing well on train and test set and there is no case
over or under fitting.
 AUC score is same in both train and test set i.e. 0.90 for NB Model.
 In KNN Model Accuracy on training set is 0.86 and test set is 0.83. Scores are very close
which indicates the model is performing well on train and test set and there is no case over or
under fitting.
 AUC score is same in both train and test set i.e. 0.93 for KNN Model.
 For both the model F1 Score, Precision and Recall score is better in Labour party as
compared to Conservative party.
 If we compare both the model, we can see that KNN is performing better than Naïve Bayes.

Accuracy Precision F1 Score Recall AUC

Model Class Train Test Train Test Train Test Train Test Train Test
Conservative 0.83 0.82 0.72 0.74 0.71 0.73 0.69 0.73 0.9 0.9
NB Labour 0.83 0.82 0.88 0.87 0.88 0.87 0.89 0.87 0.9 0.9
Conservative 0.86 0.83 0.77 0.76 0.75 0.73 0.72 0.71 0.93 0.93
KNN Labour 0.86 0.83 0.89 0.86 0.9 0.87 0.91 0.89 0.93 0.93

Random Forest, Bagging and boosting performance analysis

 In Simple Random forest, we can see that scores are 100% in Precision, Recall, Accuracy, F1
and AUC for training data set. However, scores are poor in Test set. This is a clear example
of overfitting case.
 We have then used Random forest using Bagging technique. Now, we can see that the
difference in Training and test set has improved slightly.
 In Adaptive boosting, accuracy score for Train set is 0.85 and test set is 0.81.
 AUC score for Ada boosting for both train and test set is 0.92.
 In gradient boosting, accuracy score for Train set is 0.89 and test set is 0.83.
 AUC score for gradient boosting for both train and test set is 0.95.
 For all 4 models F1 Score, Precision and Recall score is better in Labour party as compared
to Conservative party.

41 | P a g e
Accuracy Precision F1 Score Recall AUC
Model Class Train Test Train Test Train Test Train Test Train Test
Conservative 1 0.83 1 0.78 1 0.73 1 0.69 1 1
RF Labour 1 0.83 1 0.85 1 0.88 1 0.9 1 1
Conservative 0.97 0.83 0.98 0.78 0.94 0.73 0.9 0.68 1 1
Bagging Labour 0.97 0.83 0.96 0.85 0.98 0.88 0.99 0.9 1 1
Ada Conservative 0.85 0.81 0.76 0.75 0.73 0.71 0.7 0.67 0.92 0.92
Boosting Labour 0.85 0.81 0.88 0.84 0.9 0.86 0.91 0.88 0.92 0.92
Gradient Conservative 0.89 0.83 0.84 0.79 0.81 0.73 0.78 0.68 0.95 0.95
Boosting Labour 0.89 0.83 0.91 0.85 0.93 0.88 0.94 0.91 0.95 0.95

Model Tuning for LDA, Logit, KNN and Naïve Bayes

 Grid search is the simplest algorithm for hyperparameter tuning. Basically, we divide the
domain of the hyperparameters into a discrete grid. Then, we try every combination of values
of this grid, calculating some performance metrics using cross-validation.
 In LDA –Grid search, Accuracy score is 0.83 and AUC is 0.89 for both train and test set.
 In Logit –Grid search, Accuracy score is 0.84 in training data and test is 0.83 and AUC is 0.89
for both train and test set. This shows that the model is performing well and there is no
overfitting case
 In NB –Grid search, Accuracy score is 0.83 in training and test is 0.82 and AUC is 0.90 for
both train and test set.
 In KNN –Grid search, Accuracy score is 0.84 and AUC is 0.90 for both train and test set.
 After model tuning also we can see a drastic change in the scores.
Accuracy Precision F1 Score Recall AUC
Model Class Train Test Train Test Train Test Train Test Train Test
LDA- Grid Conservative 0.83 0.83 0.74 0.76 0.69 0.75 0.65 0.74 0.89 0.89
Search Labour 0.83 0.83 0.87 0.87 0.88 0.88 0.9 0.88 0.89 0.89
Logit- Grid Conservative 0.84 0.83 0.76 0.76 0.69 0.73 0.63 0.71 0.89 0.89
Search Labour 0.84 0.83 0.86 0.86 0.89 0.87 0.92 0.88 0.89 0.89
KNN-Grid Conservative 0.84 0.84 0.76 0.8 0.71 0.74 0.66 0.69 0.9 0.9
Search Labour 0.84 0.84 0.87 0.85 0.89 0.88 0.92 0.91 0.9 0.9
NV- Grid Conservative 0.83 0.82 0.72 0.74 0.71 0.73 0.69 0.73 0.9 0.9
Search Labour 0.83 0.82 0.88 0.87 0.88 0.87 0.89 0.87 0.9 0.9

1.8 Based on these predictions, what are the insights?


 All model which have performed well on training data set, have also performed well on the
test set.
 All the model except, Random forest and bagging are good to be used for future predictions.
 Tuned models are not that well as compared to the basic ones. So, for predictions top 2
model can be KNN and Gradient Boosting.
 In this case Accuracy scores plays an important role. As for use case, true positive and true
negative predictions are essential to predict election poll.
 If we compare all the models Accuracy score in KNN and Gradient Boosting model is good in
both train and test set, which can be used for future predictions.
 There is no overfitting case except for Random Forest and Bagging.

42 | P a g e
 Labour party supporters there across all age group and because of this there can be a
positive perception about a strong economic condition. This can attract more voters in future.
 As there are many factors like household economic conditions, national economic conditions,
‘Eurosceptic’ sentiment, political knowledge etc. This will impact the decision of voters.
 During Bivariate analysis we have observed overall no.of of female voters is higher than male
voters. Parties can try to attract male supporters which can increase their vote banks.

Problem 2:

In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United
States of America:

1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the mentioned


Number of Characters in Roosevelt document is: 7571

Number of Characters in Kennedy document is: 7618
Number of Characters in Nixon document is: 9991

Number of words in Roosevelt document is: 1360

Number of words in Kennedy document is: 1390
Number of words in Nixon document is: 1819

Number of sentences in Roosevelt document is 67.

Number of sentences in Kennedy document is 52.

Number of sentences in Nixon document is 68.

43 | P a g e
2.2 Remove all the stopwords from all three speeches.

Solution- First we imported all required packages and then we’ll clean the speeches.

Cleaning and Removing Noise

It helps to get rid of unhelpful parts of the data, or noise, by converting all characters to
lowercase, removing punctuations marks, and removing stop words and typos.

 Converting all the words to lower case

We can see the texts are converted into lower case.

 Removal of Punctuation

 Removal of stop words

 Total word count post removing stop words

Roosevelt Speech

44 | P a g e
Kennedy Speech

Nixon Speech

2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (After removing the stopwords).


Roosevelt Speech top 3 words-

Kennedy Speech top 3 words-

Nixon Speech top 3 words-

2.4 Plot the word cloud of each of the speeches of the variable. (After removing the


 Word Cloud for Roosevelt speech (after cleaning)!!

45 | P a g e
 Word Cloud for Kennedy speech (after cleaning)!!

46 | P a g e
 Word Cloud for Nixon speech (after cleaning)!!

----------------------------------------------------------THE END-----------------------------------------------------

47 | P a g e

You might also like