Naive Bayes Algorithm
Naive Bayes Algorithm
Bayes’ Theorem
Bayes’ Theorem finds the probability of an event occurring given
the probability of another event that has already occurred. Bayes’
𝑃(𝐴∣𝐵)=𝑃(𝐵∣𝐴)𝑃(𝐴)𝑃(𝐵)P(A∣B)=P(B)P(B∣A)P(A)
theorem is stated mathematically as the following equation:
𝑃(𝑦∣𝑋)=𝑃(𝑋∣𝑦)𝑃(𝑦)𝑃(𝑋)P(y∣X)=P(X)P(X∣y)P(y)
following way:
𝑋=(𝑥1,𝑥2,𝑥3,…..,𝑥𝑛)X=(x1,x2,x3,…..,xn)
size n) where:
𝑃(𝑦∣𝑥1,…,𝑥𝑛)=𝑃(𝑥1∣𝑦)𝑃(𝑥2∣𝑦)…𝑃(𝑥𝑛∣𝑦)𝑃(𝑦)𝑃(𝑥1)𝑃(𝑥2)…𝑃(𝑥𝑛)P(y∣x1,…,xn
Hence, we reach to the result:
)=P(x1)P(x2)…P(xn)P(x1∣y)P(x2∣y)…P(xn∣y)P(y)
which can be expressed as:
𝑃(𝑦∣𝑥1,…,𝑥𝑛)=𝑃(𝑦)∏𝑖=1𝑛𝑃(𝑥𝑖∣𝑦)𝑃(𝑥1)𝑃(𝑥2)…𝑃(𝑥𝑛)P(y∣x1,…,xn)=P(x1)P(x2)…P(xn
)P(y)∏i=1nP(xi∣y)
Now, as the denominator remains constant for a given input, we
𝑃(𝑦∣𝑥1,…,𝑥𝑛)∝𝑃(𝑦)∏𝑖=1𝑛𝑃(𝑥𝑖∣𝑦)P(y∣x1,…,xn)∝P(y)∏i=1nP(xi∣y)
can remove that term:
𝑦=𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑦)∏𝑖=1𝑛𝑃(𝑥𝑖∣𝑦)y=argmaxyP(y)∏i=1nP(xi∣y)
This can be expressed mathematically as:
𝑃(𝑌𝑒𝑠∣𝑡𝑜𝑑𝑎𝑦)=𝑃(𝑆𝑢𝑛𝑛𝑦𝑂𝑢𝑡𝑙𝑜𝑜𝑘∣𝑌𝑒𝑠)𝑃(𝐻𝑜𝑡𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒∣𝑌𝑒𝑠)𝑃(𝑁𝑜𝑟
today = (Sunny, Hot, Normal, False)
𝑚𝑎𝑙𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦∣𝑌𝑒𝑠)𝑃(𝑁𝑜𝑊𝑖𝑛𝑑∣𝑌𝑒𝑠)𝑃(𝑌𝑒𝑠)𝑃(𝑡𝑜𝑑𝑎𝑦)P(Yes∣today)=P(today)P(
SunnyOutlook∣Yes)P(HotTemperature∣Yes)P(NormalHumidity∣Yes)P(NoWind∣Yes)P(Yes)
𝑃(𝑁𝑜∣𝑡𝑜𝑑𝑎𝑦)=𝑃(𝑆𝑢𝑛𝑛𝑦𝑂𝑢𝑡𝑙𝑜𝑜𝑘∣𝑁𝑜)𝑃(𝐻𝑜𝑡𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒∣𝑁𝑜)𝑃(𝑁𝑜𝑟𝑚𝑎
and probability to not play golf is given by:
𝑙𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦∣𝑁𝑜)𝑃(𝑁𝑜𝑊𝑖𝑛𝑑∣𝑁𝑜)𝑃(𝑁𝑜)𝑃(𝑡𝑜𝑑𝑎𝑦)P(No∣today)=P(today)P(SunnyO
utlook∣No)P(HotTemperature∣No)P(NormalHumidity∣No)P(NoWind∣No)P(No)
Since, P(today) is common in both probabilities, we can ignore
𝑃(𝑌𝑒𝑠∣𝑡𝑜𝑑𝑎𝑦)∝39.29.69.69.914≈0.02116P(Yes∣today)∝93.92.96.96.149≈0.02116
P(today) and find proportional probabilities as:
𝑃(𝑁𝑜∣𝑡𝑜𝑑𝑎𝑦)∝35.25.15.25.514≈0.0068P(No∣today)∝53.52.51.52.145≈0.0068
and
𝑃(𝑌𝑒𝑠∣𝑡𝑜𝑑𝑎𝑦)+𝑃(𝑁𝑜∣𝑡𝑜𝑑𝑎𝑦)=1P(Yes∣today)+P(No∣today)=1
Now, since
𝑃(𝑌𝑒𝑠∣𝑡𝑜𝑑𝑎𝑦)=0.021160.02116+0.0068≈0.0237P(Yes∣today)=0.02116+0.00680.02116
sum equal to 1 (normalization):
≈0.0237
𝑃(𝑁𝑜∣𝑡𝑜𝑑𝑎𝑦)=0.00680.0141+0.0068≈0.33P(No∣today)=0.0141+0.00680.0068≈0.33
and
𝑃(𝑌𝑒𝑠∣𝑡𝑜𝑑𝑎𝑦)>𝑃(𝑁𝑜∣𝑡𝑜𝑑𝑎𝑦)P(Yes∣today)>P(No∣today)
Since
𝑃(𝑥𝑖∣𝑦)=12𝜋𝜎𝑦2𝑒𝑥𝑝(−(𝑥𝑖−𝜇𝑦)22𝜎𝑦2)P(xi∣y)=2πσy21exp(−2σy2(xi−μy)2)
conditional probability is given by:
Python
Output:
Gaussian Naive Bayes model accuracy(in %): 95.0
Multinomial Naive Bayes
Feature vectors represent the frequencies with which certain
events have been generated by a multinomial distribution. This is
the event model typically used for document classification.
Bernoulli Naive Bayes
In the multivariate Bernoulli event model, features are
independent booleans (binary variables) describing inputs. Like
the multinomial model, this model is popular for document
classification tasks, where binary term occurrence(i.e. a word
occurs in a document or not) features are used rather than term
frequencies(i.e. frequency of a word in the document).
Advantages of Naive Bayes Classifier
Easy to implement and computationally efficient.
Effective in cases with a large number of features.
Performs well even with limited training data.
It performs well in the presence of categorical features.
For numerical features data is assumed to come from
normal distributions
Disadvantages of Naive Bayes Classifier
Assumes that features are independent, which may not
always hold in real-world data.
Can be influenced by irrelevant attributes.
May assign zero probability to unseen events, leading to
poor generalization.
Applications of Naive Bayes Classifier
Spam Email Filtering: Classifies emails as spam or non-
spam based on features.
Text Classification: Used in sentiment analysis,
document categorization, and topic classification.
Medical Diagnosis: Helps in predicting the likelihood of
a disease based on symptoms.
Credit Scoring: Evaluates creditworthiness of individuals
for loan approval.
Weather Prediction: Classifies weather conditions
based on various factors.
As we reach to the end of this article, here are some important
points to ponder upon:
In spite of their apparently over-simplified assumptions,
naive Bayes classifiers have worked quite well in many
real-world situations, famously document classification
and spam filtering. They require a small amount of
training data to estimate the necessary parameters.
Naive Bayes learners and classifiers can be extremely fast
compared to more sophisticated methods. The decoupling
of the class conditional feature distributions means that
each distribution can be independently estimated as a
one dimensional distribution. This in turn helps to
alleviate problems stemming from the curse of
dimensionality.
Conclusion
In conclusion, Naive Bayes classifiers, despite their simplified
assumptions, prove effective in various applications, showcasing
notable performance in document classification and spam
filtering. Their efficiency, speed, and ability to work with limited
data make them valuable in real-world scenarios, compensating
for their naive independence assumption.
Frequently Asked Questions on Naive Bayes
Classifiers
What is Naive Bayes real example?
Naive Bayes is a simple probabilistic classifier based on Bayes’
theorem. It assumes that the features of a given data point are
independent of each other, which is often not the case in reality.
However, despite this simplifying assumption, Naive Bayes has
been shown to be surprisingly effective in a wide range of
applications.
Why is it called Naive Bayes?
Naive Bayes is called “naive” because it assumes that the
features of a data point are independent of each other. This
assumption is often not true in reality, but it does make the
algorithm much simpler to compute.
What is an example of a Bayes classifier?
A Bayes classifier is a type of classifier that uses Bayes’ theorem
to compute the probability of a given class for a given data point.
Naive Bayes is one of the most common types of Bayes
classifiers.
What is better than Naive Bayes?
There are several classifiers that are better than Naive Bayes in
some situations. For example, logistic regression is often more
accurate than Naive Bayes, especially when the features of a data
point are correlated with each other.
Can Naive Bayes probability be greater than 1?
No, the probability of an event cannot be greater than 1. The
probability of an event is a number between 0 and 1, where 0
indicates that the event is impossible and 1 indicates that the
event is certain.
In this article, we will discuss the mathematical intuition behind Naive Bayes
Classifiers, and we’ll also see how to implement this on Python.Also We Will
This model is easy to build and is mostly used for large datasets. It is a
probabilistic machine learning model that is used for classification problems. The
core of the classifier depends on the Naive Bayes theorem with an assumption of
when they may not be. In a real-world scenario, there is hardly any situation
Naive Bayes does seem to be a simple yet powerful algorithm. But why is it so
popular?
Since it is a probabilistic approach, the predictions can be made real quick. It can
Before we dive deeper into this topic we need to understand what is “Conditional
Bayes’ theorem.
The Naive Bayes algorithm is a popular and simple classification algorithm used
Naive Bayes is a simple but powerful method in machine learning used for
guessing categories of things. Imagine sorting emails into spam or inbox. Naive
Bayes looks at each word (like a clue) and predicts how likely it is to be spam
based on past emails. It assumes these words aren’t connected (not always
true!), but it’s fast and works well, making it a popular choice for many
tasks.sharemore_vert
Suppose I ask you to pick a card from the deck and find the probability of getting
Observe carefully that here I have mentioned a condition that the card is clubs.
Now while calculating the probability my denominator will not be 52, instead, it
Since we have only one king in clubs the probability of getting a KING given the
Consider a random experiment of tossing 2 coins. The sample space here will
be:
If a person is asked to find the probability of getting a tail his answer would be
3/4 = 0.75
Now suppose this same experiment is performed by another person but now we
give him the condition that both the coins should have heads. This means if
event A: ‘Both the coins should have heads’, has happened then the elementary
outcomes {HT, TH, TT} could not have happened. Hence in this situation, the
From the above examples, we observe that the probability may change if some
additional information is given to us. This is exactly the case while building any
machine learning model, we need to find the output given some features.
Bayes’ Rule
Now we are prepared to state one of the most useful results in conditional
1763 provides a means for calculating the probability of an event given some
information.
Basically, we are trying to find the probability of event A, given event B is true.
Here P(B) is called prior probability which means it is the probability of an event
evidence is seen.
Bayes’ rule provides us with the formula for the probability of Y given some
feature X. In real-world problems, we hardly find any case where there is only
one feature.
When the features are independent, we can extend Bayes’ rule to what is called
Naive Bayes which assumes that the features are independent that means
changing the value of one feature doesn’t influence the values of other variables
Naive Bayes can be used for various things like face recognition, weather
more.
When there are multiple X variables, we simplify it by assuming that X’s are
independent, so
choice if you want to remove it or not. Removing the denominator will help you
There are a whole lot of formulas mentioned here but worry not we will try to
· All the variables are independent. That is if the animal is Dog that doesn’t mean
· All the predictors have an equal effect on the outcome. That is, the animal
being dog does not have more importance in deciding If we can pet him or not.
We should try to apply the Naive Bayes Classifier formula on the above dataset
We also need the probabilities (P(y)), which are calculated in the table below.
Now if we send our test data, suppose test = (Cow, Medium, Black)
We know P(Yes|Test)+P(No|test) = 1
We see here that P(Yes|Test) > P(No|Test), so the prediction that we can pet this
animal is “Yes”.
discrete values. But what if they are continuous? For this, we need to make
some more assumptions regarding the distribution of each feature. The different
naive Bayes classifiers differ mainly by the assumptions they make regarding the
Gaussian Naïve Bayes is used when we assume all the continuous variables
The conditional probability changes here since we have different values now.
Also, the (PDF) probability density function of a normal distribution is given by:
We can use this formula to compute the probability of likelihoods if our data is
continuous.
Endnotes
Naive Bayes algorithms are mostly used in face recognition, weather prediction,
learning. You have already taken your first step to master this algorithm and
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
PlayNext
Mute
Duration 18:10
Loaded: 3.30%
Â
Fullscreen
ADVERTISEMENT
ADVERTISEMENT
Once it applies the suitable algorithm, the algorithm divides the data
objects into groups according to the similarities and difference between
the objects.
ADVERTISEMENT
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Linear Regression:
ADVERTISEMENT
y= a0+a1x+ ε
Logistic Regression:
o Logistic regression is one of the most popular Machine learning algorithm
that comes under Supervised Learning techniques.
o It can be used for Classification as well as for Regression problems, but
mainly used for Classification problems.
o Logistic regression is used to predict the categorical dependent variable
with the help of independent variables.
o The output of Logistic Regression problem can be only between the 0 and
1.
o Logistic regression can be used where the probabilities between two
classes is required. Such as whether it will rain today or not, either 0 or 1,
true or false etc.
o Logistic regression is based on the concept of Maximum Likelihood
estimation. According to this estimation, the observed data should be
most probable.
o In logistic regression, we pass the weighted sum of inputs through an
activation function that can map values in between 0 and 1. Such
activation function is known as sigmoid function and the curve obtained
is called as sigmoid curve or S-curve. Consider the below image:
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.
Step-2: Select random K points or centroids. (It can be other from the
input dataset).
Step-3: Assign each data point to their closest centroid, which will form
the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these
datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not
the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point
or centroid. We will compute it by applying some mathematics that we
have studied to calculate the distance between two points. So, we will
draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear visualization.
o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like
below image:
From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points
will be assigned to new centroids.
o We can see in the above image; there are no dissimilar data points on
either side of the line, which means our model is formed. Consider the
below image:
As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:
How to choose the value of "K number of clusters" in
K-means Clustering?
The performance of the K-means clustering algorithm depends upon
highly efficient clusters that it forms. But choosing the optimal number of
clusters is a big task. There are some different ways to find the optimal
number of clusters, but here we are discussing the most appropriate
method to find the number of clusters or value of K. The method is given
below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal
number of clusters. This method uses the concept of WCSS
value. WCSS stands for Within Cluster Sum of Squares, which defines
the total variations within a cluster. The formula to calculate the value of
WCSS (for 3 clusters) is given below:
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances
between each data point and its centroid within a cluster1 and the same
for the other two terms.
To measure the distance between data points and centroid, we can use
any method such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below
steps:
Since the graph shows the sharp bend, which looks like an elbow, hence it
is known as the elbow method. The graph for the elbow method looks like
the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose
the number of clusters equal to the data points, then the value of WCSS becomes zero,
and that will be the endpoint of the plot.
o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters
o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our
model, which is part of data pre-processing. The code is given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing
mathematics calculation, matplotlib is for plotting the graph,
and pandas are for managing the dataset.
By executing the above lines of code, we will get our dataset in the
Spyder IDE. The dataset looks like the below image:
Here we don't need any dependent variable for data pre-processing step
as it is a clustering problem, and we have no idea about what to
determine. So we will just add a line of code for the matrix of features.
As we know, the elbow method uses the WCSS concept to draw the plot
by plotting WCSS values on the Y-axis and the number of clusters on the
X-axis. So we are going to calculate the value for WCSS for different k
values ranging from 1 to 10. Below is the code for it:
ADVERTISEMENT
As we can see in the above code, we have used the KMeans class of
sklearn. cluster library to form the clusters.
After that, we have initialized the for loop for the iteration on a different
value of k ranging from 1 to 10; since for loop in Python, exclude the
outbound limit, so it is taken as 11 to include 10 th value.
The rest part of the code is similar as we did in earlier topics, as we have
fitted the model on a matrix of features and then plotted the graph
between the number of clusters and WCSS.
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number
of clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number of clusters, so we can now train the model on
the dataset.
To train the model, we will use the same two lines of code as we have
used in the above section, but here instead of using i, we will use 5, as we
know there are 5 clusters that need to be formed. The code is given
below:
The first line is the same as above for creating the object of KMeans class.
By executing the above lines of code, we will get the y_predict variable.
We can check it under the variable explorer option in the Spyder IDE.
We can now compare the values of y_predict with our original dataset.
Consider the below image:
From the above image, we can now relate that the CustomerID 1 belongs
to a cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs
to cluster 4, and so on.
Step-4: Visualizing the Clusters
The last step is to visualize the clusters. As we have 5 clusters for our
model, so we will visualize each cluster one by one.
To visualize the clusters will use scatter plot using mtp.scatter() function
of matplotlib.
In above lines of code, we have written code for each clusters, ranging
from 1 to 5. The first coordinate of the mtp.scatter, i.e., x[y_predict == 0,
0] containing the x value for the showing the matrix of features values,
and the y_predict is ranging from 0 to 1.
Output:
The output image is clearly showing the five different clusters with
different colors. The clusters are formed between two parameters of the
dataset; Annual income of customer and Spending. We can change the
colors and labels as per the requirement or choice. We can also observe
some points from the above patterns, which are given below:
o Cluster1 shows the customers with average salary and average spending
so we can categorize these customers as
o Cluster2 shows the customer has a high income but low spending, so we
can categorize them as careful.
o Cluster3 shows the low income and also low spending so they can be
categorized as sensible.
o Cluster4 shows the customers with low income with very high spending so
they can be categorized as careless.
o Cluster5 shows the customers with high income and high spending so they
can be categorized as target, and these customers can be the most
profitable customers for the mall owner.