Statistics
Statistics
Statistics
Chapter 4
Naive Bayes Algorithm
Objectives
• What is working principle of Naïve Bayes algorithm?
• Mathematics behind it?
• Advantage, disadvantage and applications of Naïve Bayes algorithm
• Implementation of spam mail classifier
• Encoding of categorical data
• Evaluation metrics of classification models
• How to export a ML model as a file?
Probability
What is probability?
Probability is the extent to which an event is likely to occur, measured by the ratio of the favorable
cases to the whole number of cases possible.
Number of events
Probablity of event E occur =
Total number of samples
N(E)
P(E) =
N(S)
N(loves soda) 7
P(loves soda) = = = 0.5
Total number of the family 14
1|P a ge
Naive Bayes Algorithm • Bemnet Girma
P(A ∩ B)
P(A|B) =
P(B)
Example: What is probability of meeting someone who loves both candy and soda given that he/she
loves candy?
2
P(loves candy ∩ loves soda) 14 2
P(loves candy and soda | loves soda) = = = = 0.2857
P(loves soda) 7 7
14
Can we solve the above problem without knowing P(loves candy ∩ loves soda) but you already know
the probability of meeting someone who loves both candy and soda given that he/she loves soda?
P(A∩B) P(B∩A)
P(A|B) = , P(B|A) =
P(B) P(A)
P(B|A)∗P(A)
Since P(A ∩ B) = P(B ∩ A), P(A | B) * P(B) = P(B | A) * P(A), P(A|B) = P(B)
“
Probability of an event occurring based on prior knowledge of conditions that might be related to the event.
”
P(B|A)∗P(A)
Formula: P(A|B) = P(B)
Where:
P(A|B) | Posterior Probability | is conditional probability of event A occurring given event B.
Definition of posterior: a Latin word ”posterus” meaning coming after.
2|P a ge
Naïve Bayes Algorithm • Bemnet Girma
Example: Pick a random card, you already know it is a diamond. Now what is the probability of that
card being a queen?
N(diamond) 13 1
P(diamond) = = =
Total number of cards 52 4
N(queen) 4 1
P(queen) = = =
Total number of cards 52 4
N(queen diamond) 1 1
P(diamond ∩ queen) = = =
Total number of cards 52 52
1
P(diamond ∩ queen) 52 1
P(diamond | queen) = = =
P(diamond) 1 13
4
1 1
P(diamond | queen) ∗ P(queen) 13 ∗ 4 1
P(queen | diamond) = = =
P(diamond) 1 13
4
3|P a ge
Naive Bayes Algorithm • Bemnet Girma
2. Multinomial Naive Bayes : It is used when we have discrete data (e.g. movie ratings ranging 1
and 5 as each rating will have certain frequency to represent). In text learning we have the count
of each word to predict the class or label.
3. Gaussian Naive Bayes : Because of the assumption of the normal distribution, Gaussian Naive
Bayes is used in cases when all our features are continuous. For example in Iris dataset features
are sepal width, petal width, sepal length, petal length. So its features can have different values
in data set as width and length can vary. We can’t represent features in terms of their
occurrences. This means data is continuous. Hence we use Gaussian Naive Bayes here.
4|P a ge
Naïve Bayes Algorithm • Bemnet Girma
5|P a ge
Naive Bayes Algorithm • Bemnet Girma
In [ ]: 1 print("Hello World")
6|P a ge
Naïve Bayes Algorithm • Bemnet Girma
Pandas
Pandas is an open-source library that provides high-performance data manipulation in Python. It is used
for data analysis in Python and developed by Wes McKinney in 2008.
Definition of pandas: The name derived from the word panel data.
Data analysis requires lots of processing, such as reading, restructuring, cleaning or merging, etc.
There are different tools are available for fast data processing, such as Numpy, Scipy, Cython, and
Panda. But we prefer Pandas because working with Pandas is fast, simple and more expressive than
other tools.
To read and process our dataset we need to install and import pandas library, Python pip is the package
manager for Python packages. We can use pip to install packages or libraries that do not come with
Python.
Definition of pip: Pip Installs Packages
In [1]: 1 # install pandas /make sure your internet connection is working fine/
2 pip install pandas
Let us import pandas library as pd because whenever we want to use pandas library we will easily call
pd instead of pandas.
Out[2]: '1.3.4'
Now let us import our dataset that we are going to use for training and testing our classifier model.
download the dataset from here:
https://github.com/bemnetdev/Mini-Machine-Learning/blob/main/Naive%20Bayes/spam.csv
Out[4]:
Category Message
13 ham I've been searching for the right words to tha...
2145 spam FreeMsg: Hey - I'm Buffy. 25 and love to satis...
2043 ham Me not waking up until 4 in the afternoon, sup
7|P a ge
Naive Bayes Algorithm • Bemnet Girma
We can explore general setting of our dataset using the following codes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Category 5572 non-null object
1 Message 5572 non-null object
dtypes: object(2)
memory usage: 87.2+ KB
Out[6]: (5572, 2)
Out[8]:
Category Message
count 5572 5572
unique 2 5157
top ham Sorry, I’ll call later
freq 4825 30
8|P a ge
Naïve Bayes Algorithm • Bemnet Girma
We encode categorical data numerically because math is generally done using numbers. A big part of
encoding is converting text to numbers. Just like that, our algorithms cannot run and process data if that
data is not numerical.
Degree
0 Masters
1 Bachelors
2 Bachelors
3 Masters
4 Phd
9|P a ge
Naive Bayes Algorithm • Bemnet Girma
Degree
0 2
1 1
2 1
3 2
4 3
OneHot Encoding
One hot encoding is a process of converting categorical data variables so they can be provided to
machine learning algorithms to improve predictions. It is recommended for Nominal categorical data
Example: green, yellow, red, pink, black
Based on the above example create a dataframe like down below using animal column
Color
0 Green
1 Yellow
2 Red
3 Pink
4 Black
10 | P a g e
Naïve Bayes Algorithm • Bemnet Girma
Out[14]:
Dummy variables
Dummy variables are variables containing values 0 or 1 representing the presence or absence of the
categorical value such as ham = 1, spam = 0.
The solution to the dummy variable trap is to drop one of the categorical variables – if there are m
number of categories. Use m - 1 in the model
Out[15]:
Category_spam Message
Train-Test Split
Now to train and evaluate our classifier we need to randomly split our dataframe into training and test
set. Training set will be 80% of the whole dataset and test set will be the rest 20%. The model is trained
using the training dataset and then evaluated using the test dataset. This method is called Holdout
method. To split our dataset we need to install a python library called “scikit-learn”.
11 | P a g e
Naive Bayes Algorithm • Bemnet Girma
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides
a selection of efficient tools for machine learning and statistical modelling including classification,
regression, clustering…etc.
Here our feature (input) is the message and label (output) is Category_spam, the following figure
illustrates how our dataset will be split.
0 Message 1
1 Message 2
0 Message 3 80% of the dataset
. . (Training set)
. .
. .
0 Message 5571 20% of the dataset
1 Message 5572 (Test set)
Label (Output) Feature (Input) . X_train . y_train
X y . X_test . y_test
In [17]: 1 # Split the dataframe into X_train, X_test, y_train and y_test
2 from sklearn.model_selection import train_test_split
3 X_train, X_test, y_train, y_test = train_test_split(df.Message,
4 df.Category_spam, test_size = 0.2, random_state = 10)
5 print("Size of X_train = ", X_train.shape[0])
6 print("Size of X_test = ", X_test.shape[0])
7 print("Size of y_train = ", y_train.shape[0])
8 print("Size of y_test = ", y_test.shape[0])
But, in the above train test split we have considered purely random sampling methods. This is generally
fine if your dataset is large enough (especially relative to the number of attributes) and balanced with
respect to class labels, but if it is not, you run the risk of introducing a significant sampling bias.
What is sampling bias?
Sampling bias is a bias that occurs when some members of the intended population have a higher or
lower probability of being selected than others. This results in a biased sample of a population in which
not all individuals were equally likely to have been selected. Although a bias in the sample may not
always invalidate the data gathered.
12 | P a g e
Naïve Bayes Algorithm • Bemnet Girma
When a survey company decides to call 1,000 people to ask them a few questions, they don’t just pick
1,000 people randomly in a phone book. They try to ensure that these 1,000 people are representative
of the whole population. For example, the US population is 51.3% females and 48.7% males, so a well-
conducted survey in the US would try to maintain this ratio in the sample: 513 female and 487 male.
This is called stratified sampling:
Out[18]: 0 0.865829
1 0.134171
Name: Category_spam, dtype: float64
Out[19]: 0 0.865937
1 0.134063
Name: Category_spam, dtype: float64
Previously, we encode our label (output) data to integer 0 and 1, but how about our Message column
data, these texts also need to encode (convert to integer) using Count Vectorizer.
CountVectorizer
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a
given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
This is helpful when we have multiple such texts, and we wish to convert each word in each text into
vectors (for using in further text analysis). Let us consider a few sample texts from a document (each
as a list element):
document = [ "Dear friend are you free now",
"Free gift you can win now",
"Hello friend you have class"]
are can class dear free friend gift have hello now win you
document[0] 1 0 0 1 1 1 0 0 0 1 0 1
document[1] 0 1 0 0 1 0 1 0 0 1 1 1
document[2] 0 0 1 0 0 1 0 1 1 0 0 1
Key Observations:
1. There are 12 unique words in the document, represented as columns of the table.
2. There are 3 text samples in the document, each represented as rows of the table.
3. Every cell contains a number, that represents the count of the word in that particular text.
4. All words have been converted to lowercase.
5. The words in columns have been arranged alphabetically.
Inside CountVectorizer, these words are not stored as strings. Rather, they are given a particular index
value. In this case, ‘are’ would have index 0, ‘can’ would have index 1, ‘class’ would have index 2 and
so on. The actual representation has been shown in the table below – this is called sparse matrix.
13 | P a g e
Naive Bayes Algorithm • Bemnet Girma
0 1 2 3 4 5 6 7 8 9 10 11
1 0 0 1 1 1 0 0 0 1 0 1
0 1 0 0 1 0 1 0 0 1 1 1
0 0 1 0 0 1 0 1 1 0 0 1
Out[21]: MultinomialNB()
14 | P a g e
Naïve Bayes Algorithm • Bemnet Girma
Out[23]: 0.9932690150325331
15 | P a g e
Naive Bayes Algorithm • Bemnet Girma
Types of Cross-Validation
Cross-Validation has following types-
1. Leave One Out Cross-Validation
2. K-Fold Cross-Validation.
Advantages of LOOCV?
An advantage of using this method is that we make use of all data points and hence it is low bias.
Disadvantages of LOOCV?
The major drawback of this method is that it leads to higher variation in the testing model as we are
testing against one data point. If the data point is an outlier it can lead to higher variation. Another
drawback is it takes a lot of execution time as it iterates over ‘the number of data points’ times.
16 | P a g e
Naïve Bayes Algorithm • Bemnet Girma
We can take all Accuracies, and find out the Mean. And that Mean of all 5 accuracies is the actual
accuracy of your model.
By doing so, your accuracy will not fluctuate as in Train-Test Split. Moreover, you will get your model
minimum accuracy and maximum accuracy.
Does Cross-Validation improve accuracy?
K Fold Cross Validation is all about estimating the accuracy, not improving the accuracy. If you
increase the K value, it will increase the accuracy of the measurement of your accuracy. But it will not
improve the original accuracy.
It helps you to choose the accurate machine learning algorithm for your problem. You can check the
accuracy of each Machine learning algorithm by estimating its performance, and then based on its
accuracy, you can choose the best one for your problem.
How to Choose the value of K?
The value of k should be chosen carefully. If you choose poor k value, it will result in high
variance or high bias.
The value of K depends upon the size of the data. Along with that, How much your system is
capable to afford the computational cost. The high K value, the more rounds or folds you need
to perform.
So, before selecting the K value, look at your data size and your system computation power.
K = 10 This is the value found by after various experiments. k=10 will result in a model
with low bias and moderate variance. So if you are struggling to choose the value of k for your
dataset, you can choose k=10. The value of k as 10 is very common in the field of machine
learning.
K = n The value of k is n, where n is the size of the dataset. That means using each record
in a dataset to test the model. That is nothing but Leave One Out Approach.
5 < K > 10 There is no formal rule but the value of k should be 5 or 10.
NB: Time complexity of K fold cross-validation is O(Kn). Where, n is the sample size and K is constant.
17 | P a g e
Naive Bayes Algorithm • Bemnet Girma
Confusion Matrix
The confusion matrix is a matrix used to determine the performance of the classification models for a
given set of data. It can only be determined if the true values for test data are known. The matrix itself
can be easily understood, but the related terminologies may be confusing. Since it shows the errors in
the model performance in the form of a matrix, hence also known as an error matrix. Some features of
Confusion matrix are given below:
18 | P a g e
Naïve Bayes Algorithm • Bemnet Girma
For binary classifier, the matrix is of 2*2 table, for 3 classes classifier, it is 3*3 table, and so on
The matrix is divided into two dimensions, that are predicted values and actual values along
with the total number of predictions.
Predicted values are those values, which are predicted by the model, and actual values are the
true values for the given observations. It looks like the below table:
True Positive: The model has predicted yes, and the actual value was also true.
True Negative: Model has given prediction No, and the real or actual value was also No.
False Negative: The model has predicted no, but the actual value was Yes, it is also called as
Type-II error.
False Positive: The model has predicted Yes, but the actual value was No. It is also called a
Type-I error.
19 | P a g e
Naive Bayes Algorithm • Bemnet Girma
Out[28]: 0.9932690150325331
2. Misclassification rate: It is also termed as Error rate, and it defines how often the model gives
the wrong predictions. The value of error rate can be calculated as the number of incorrect
predictions to all number of the predictions made by the classifier. The formula is given below:
𝑭𝑷 + 𝑭𝑵
𝑬𝒓𝒓𝒐𝒓 𝑹𝒂𝒕𝒆 =
𝑻𝑷 + 𝑭𝑷 + 𝑭𝑵 + 𝑻𝑵
Out[29]: 0.006730984967466935
20 | P a g e
Naïve Bayes Algorithm • Bemnet Girma
In spam detection scenario False Positive means the mail is normal but it is predicted as a spam.
So we need minimum False Positive value (Type II error) because the user might miss an
important mail but predicted as a spam. Whenever FP is much more important, we need to
evaluate precision of our model.
3. Precision / Positive Prediction Value: It can be defined as out of all positive classes that have
predicted correctly by the model, how many of them were actually true. It can be calculated
using the below formula:
𝑻𝑷
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
𝑻𝑷 + 𝑭𝑷
Out[29]: 0.9863013698630136
Suppose we are predicting whether a patient is having cancer or not, Here False Negative means
the patient is having cancer but it is predicted as not having cancer. So we need minimum False
Negative value (Type I error) because the patient might lost due to cancer but predicted as
cancer-free. Whenever FP is much more important, we need to evaluate precision of our model.
4. Recall / Sensitivity / True Positive Rate: It is defined as the out of total positive classes, how
our model predicted correctly. The recall must be as high as possible.
𝑻𝑷
𝑹𝒆𝒄𝒂𝒍𝒍 (𝑻𝑷𝑹) =
𝑻𝑷 + 𝑭𝑵
Out[30]: 0.9632107023411371
Whenever both FP and FN are highly important, we need to evaluate both precision and recall
of our model using harmonic mean of them called f1-score.
5. F-measure: If two models have low precision and high recall or vice versa, it is difficult to
compare these models. So, for this purpose, we can use F-score. This score helps us to evaluate
the recall and precision at the same time. The F-score is maximum if the recall is equal to the
precision. It can be calculated using the below formula:
𝟐 ∗ 𝑹𝒆𝒄𝒂𝒍𝒍 ∗ 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏
𝑭𝟏_𝒔𝒄𝒐𝒓𝒆 =
𝑹𝒆𝒄𝒂𝒍𝒍 + 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏
In [31]: 1 # training f1-score
2 from sklearn.metrics import f1_score
3 f1_score(y_train, y_train_pred)
Out[31]: 0.9746192893401014
21 | P a g e
Naive Bayes Algorithm • Bemnet Girma
Classification Report
A classification report is an overall performance evaluation metric in machine learning. It is used to
show the accuracy, precision, recall, F1 Score, and support for each target class.
You can refer the following article to know more about classification report:
https://towardsdatascience.com/understanding-data-science-classification-metrics-in-scikit-learn-in-python-3bc336865019
22 | P a g e
Naïve Bayes Algorithm • Bemnet Girma
Out[36]: 0.9883408071748879
Out[37]: 0.011659192825112075
Out[38]: 0.9657534246575342
Out[39]: 0.9463087248322147
Out[40]: 0.9559322033898304
There are other advanced performance evaluation metrics of classification model, since those are
beyond our scope we can’t include it here. please go through the following articles to know more.
ROC Curve: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
23 | P a g e
Naive Bayes Algorithm • Bemnet Girma
Out[43]: MultinomialNB()
Out[44]: ['spam_detector_v2']
Out[45]: MultinomialNB()
24 | P a g e
Naïve Bayes Algorithm • Bemnet Girma
Key takeaways
• Probability is the chance that a given event will occur.
• Conditional probability is the probability of an event (A), given that another (B) has already occur.
• Naïve Bayes classifier is a ML classifier which is based on probabilistic logic that is Bayes theorem.
• Bernoulli NB, Multinomial NB and Gaussian NB are types of Naïve Bayes classifier.
• To train and evaluate our classifier we need to split our dataframe into training and test set.
• Sampling bias is a bias that occurs when some members of the intended population are not selected.
• CountVectorizer is a tool to transform a given text into a vector based on frequency of each word.
• Precision is the ratio of actual true & all positive classes that model predicted correctly.
• Recall is defined as out of total positive classes, how many of them predicted correctly.
• F1-score helps us to evaluate the recall and precision at the same time.
• ROC curve, Lift and Gain chart are other advanced performance metrics for classification models.
• A classification report is overall performance (accuracy, precision, recall, F1, support) for each class.
• We can simply export your model using pickle or joblib then use it later.
25 | P a g e