Performance Comparison and Implementation of Bayesian Variants For Network Intrusion Detection

Performance Comparison and Implementation of Bayesian
Variants for Network Intrusion Detection
1st Tosin Ige 2nd Christopher Kiekintveld

Department of Computer Science Department of Computer Science
University of Texas at El Paso University of Texas at El Paso
Texas, USA Texas, USA
Abstract—Bayesian classifiers perform well when each of the Naïve Bayesian algorithms are some of the most important
features is completely independent of the other which is not always classifiers used for prediction. Bayesian classifiers are based on
valid in real world application. The aim of this study is to probability and with general assumption that all features are
implement and compare the performances of each variant of independent of each other which doesn’t usually hold in real
Bayesian classifier (Multinomial, Bernoulli, and Gaussian) on
world, these assumptions account for why Naïve Bayes
anomaly detection in network intrusion, and to investigate
whether there is any association between each variant’s algorithm performed poorly on certain classification, the
assumption and their performance. Our investigation showed that assumption is in addition to individual assumption as each
each variant of Bayesian algorithm blindly follows its assumption variant of Naïve Bayes classifiers of.
regardless of feature property, and that the assumption is the
single most important factor that influences their accuracy. i. Multinomial Naïve Bayes
Experimental results show that Bernoulli has accuracy of 69.9% ii. Bernoulli Naïve Bayes
test (71% train), Multinomial has accuracy of 31.2% test (31.2% iii. Gaussian Naïve Bayes
train), while Gaussian has accuracy of 81.69% test (82.84% train).
Going deeper, we investigated and found that each Naïve Bayes
Has different assumptions which impacts their efficiency and
variants performances and accuracy is largely due to each
classifier assumption, Gaussian classifier performed best on accuracy on certain tasks. Virtually all existing comparison and
anomaly detection due to its assumption that features follow evaluation of Bayesian classifier with other algorithm are
normal distributions which are continuous, while multinomial without acknowledgement of the fact that each variants works
classifier have a dismal performance as it simply assumes discreet based on different assumption which affects their efficiency and
and multinomial distribution. accuracy depending on the type of classification. Since, each
Bayesian variant performed differently, it sounds interesting to
Keywords—anomaly detection, multinomial bayes, Bernoulli understand how each algorithm performed on intrusion dataset,
bayes, gaussian bayes, Bayesian classifier, intrusion detection and to understand the reason why some intrusions are not being
I. INTRODUCTION detected by model until system compromise when the wrong
Bayesian variants was being adopted.
Security is indispensable and very crucial in modern
information technology framework [2], [4], [5], [16], [18] and The main contribution of our research is stated below.
so, as we had gotten to grapple with the fact that there is no
perfect system, no matter how sophisticated or state of the art a
• We showed that Gaussian Naïve Bayes algorithm
system could be, it can be attacked and compromised. With
performed best among all the three variants of Bayesian
hackers constantly coming up with ever changing innovative
algorithm on anomalous detection in network intrusion
and highly sophisticated ways to compromise system, focus had
in terms of efficiency and accuracy on KDD dataset,
shifted to making state of the art system extremely complicated
followed by Bernoulli with 69.9% test accuracy while
and tedious to be compromised since there cannot be a perfect
Multinomial have abysmal performance with 31.2%
system. Before any system could be compromised, there must
accuracy.
be an intrusion for any damage to occur [1], [9], [12], [15]. It is
one thing for a system to be intruded, it is another thing for the
• Our investigation also shows that each Bayesian
intrusion to be immediately detected and dealt with before any
algorithm works based on its assumption regardless of
compromise is made. An intrusion that lasted for about fifteen
data. Gaussian Naïve Bayes performed better on
(15) milliseconds before being dealt with by a combination of
anomaly detection in network intrusion because of its
machine learning (to accurately detect the actual intrusion) and
assumption that features follow normal distributions
game theory (changing parameters and configurations to
which are continuous. So, we are sure that the
prevent further attack) approach gives an insight of a system
algorithm factored in all target categories.
perfection.
II. RELATED WORK distributed according to Gaussian distribution [7],[8]. Hence,
the probability of individual features is assumed to be.
a. MULTINOMIAL NAÏVE BAYES
P(X1 | Y) = 1√2πϭ2y exp(-((xi - uy)2 )/ 2ϭ2y) (4)
In multinomial Naïve Bayes, features are assumed to be from a
multinomial distribution [3], [6], [17] which gives the For which all parameters are independent of each other. One of
probability of observing counts from a few categories [13], the simplest ways to approach this is to assume that the data has
[14], this makes it very effective for prediction when the Gaussian distribution without any co-variance. All we have to
feature(s) is discrete and not continuous. do is to find the mean and standard deviation between each label
The likelihood of given ₖ is the product of terms' to form our model. With perfect knowledge that our variable is
probabilities ₖᵢ in the statistical degree of ₖᵢ, thereby rejecting normally and continuously distributed from -∞< < +∞. The
null hypothesis: total area under the model curve is 1.
The probability of document d being in class c being computed
as;
P(c|d) α P(c) π1<k<n P(tk|c) (1)
Where P(tk|c) is the conditional probability of tk . P(tk|c) is

interpreted as a measure of evidence contributed by tk to the fact
that C is the correct class. P(c) is the prior probability of a
document occurring in class C. If a document's terms do not
provide clear evidence for one class versus another, we choose
the one that has a higher prior probability. (t1,t2,t3 …,tnd). The
best class in Bayesian classification is the most likely or
maximum posteriori (MAP) class.
Cmap=argmaxcecp(c|d) = argmaxcecp(c)π1<k<n P(tk|c) (2)
P(c) and P(tk|c) are being estimated from the training set as we
will see in a moment. Since conditional probabilities are being
multiplied together, one for each position 1<k<nd. This can
result in an underflow of floating points. And so, it is better for
us to do the computation by adding logarithms of probabilities
instead of multiplying the actual probabilities. Hence, the class
with the higher logarithmic probability will still be the most
probable. The class with the highest log probability score is still
the most probable which is;
Figure 1 Probability under Gaussian Distribution
log(xy) = log(x) + log (y). Hence,
b. BERNOULLI NAÏVE BAYES
Cmap=argmaxcec[logp(c) + Σ1<k<nd logP(tk|c)] (3)
Bernoulli Naïve Bayes [5], [10], [11] assume data is discrete
, is the actual maximization that is being done in the
and the distribution is of Bernoulli mode. Its main feature is the
implementation of Naïve Bayes which is the maximization of
acceptance of binary values such as yes or no, true of false, 0 or
the log to get the correct class of the input. Multinomial Naïve
1, success or failure as input. Its assumption of discrete and
Bayes is good for training a model when we have discrete
Bernoulli distribution as
variable and when the distribution is multinomial in nature. The
assumption that the distribution is multinomial coupled with
additional assumption of independence among the features P(x) = P[X=x] = ( =1− =0 =0) (5)
makes multinomial Naïve Bayes a drawback when the two
assumptions are not valid in test or train data. Where x can be either 0 or I but nothing else makes it suitable
for binary classification as its classification rule is according to
c. GAUSSIAN NAÏVE BAYES
P (xi | y) = p(I | y)xi + (1 – p( i | y ))(1 – xi) (6)
As a typical Bayesian classifier which assumed the value of
each feature to be completely independent of the other, it Its computation is based on binary occurrence information
assumed that each continuous value in a continuous data is (Figure 2), and so neglects the number of occurrences or
frequency, this makes Bernoulli unsuitable for certain tasks
such as document classification.
Figure 2: Bernoulli Algorithm Stepwise
Single occurrence of the word geography in physics

textbook can make it to be classified as geography as Naïve
Bayes algorithm doesn’t factor the number of occurrences
unlike multinomial.
Fig 3 CHI Square Feature Selection Test result
III. RESEARCH METHODOLOGY
b. EXPERIMENTAL SETUP
a. FEATURE SELECTION
We did separate implementation for each Bayesian variants
Feature selection, which is also being referred to as attribute (Gaussian, Multinomial, and Bernoulli) using KDD dataset
selection is a process of extracting the most relevant features obtained from Kaggle. It has (692703, 79) shape, the dataset
from the dataset for the purpose of using classifier to train a could not be used directly without preprocessing due to the
model in order to ensure overall better performance of the presence of several nil values, missing values, and negative
model. Since the presence of large number of irrelevant features value in it. Our task of preprocessing the dataset involves the
in dataset increases both the training time and the risk of following step.
overfitting, having an effective feature selection method is a
necessity. We use Chi-square test for categorical features in 1. Visualizing the list of categorical variables in the dataset
KDD dataset. The calculation in Chi-square test between each
2. Viewing frequency counts and distribution of each
feature and the target greatly helps to determine if there is any
categorical variable
association between two categorical variables in the dataset and
whether such association will influence the prediction. This 3. Checking for missing value in the categorical variable
ultimately helps selecting the desired number of features with
best Chi-square test scores. Chi-square test is a technique to 4. Checking for missing value and visualization of the
determine the relationship between the categorical variables. numerical variable
The chi-square value is calculated between each feature and the Altogether, we have thousands of NAN values, NULL
target variable, and the desired number of features with the best values, and negative values. So, we adopted statistical method
chi-square value was selected. of approach that is based on the use of standard deviation and
There are three main types of Chi-square tests, tests of goodness mean to analyze each column. There are 79 columns in the
of fit, the test of independence, and the test for homogeneity. All dataset and so, it is wise to treat each column as separate entity
three tests rely on the same formula to compute a test statistic. when addressing missing values, hence for individual column, a
Since unsupervised feature selection techniques tend to ignore combination of standard deviation and mean for the column was
target variable like in the case of methods removing redundant used to fill the missing and null values. There are six categories
variables using correlation. We chose supervised feature in the label dataset which needs encoding, and so one-hot
selection techniques which use target variables like methods that encoding was used to encode the categorical variable. One of the
can remove irrelevant variables from dataset. most important tasks in the data preprocessing was the feature
selection (Figure 3), since our dataset has 79 features, only
relevant features are needed to get optimal result.
A combination of CHI square feature selection test method gave abysmal performance on the KDD dataset and was the
and confusion matrix was used to select only the relevant worst performance among the three (Figure 4), (Figure 5),
features in the dataset to train our model while the irrelevant (Figure 6).
features were ignored. Our implementation gave separate results
and observation for each variant of Naïve Bayes. Multinomial
Figure 4. Confusion matrix of each Bayes on Security Dataset
which is over fitting, as seen in (figure 6) there is closeness

between train and validation accuracy to validate the
implementation. Experimental result shows that, for each variant
of Bayesian algorithm, Bernoulli have accuracy of 69.9% test
(71% train), Multinomial has accuracy of 31.2% test (31.2%
train), and Gaussian has accuracy of 81.69% test (82.84% train)
(Figure 5), (Figure 6) on security dataset. Comparing each
variant accuracy and performance will be incomplete
without diving on what might be responsible for differences in
the performances of each Bayesian variant. We went back to
thoroughly re-analyzed each of the relevant features and
categorical label for any observation and then compare whatever
our observation is to each variant assumption since each of the
three variants of Bayesian algorithm has different assumption.
Our label has six categories of attack ['BENIGN', 'DoS
Figure 5. Box plot performance comparison slowloris', 'DoS Slowhttptest', 'DoS Hulk', 'DoS GoldenEye',
'Heartbleed'] which are all encoded between 0 and 5. There was
no relevant observation in the dataset except the label categories
which ranges from 0 to 5 from which predicted output are
continuously selected from. This is according to normal
distribution and explains why Gaussian variants of Bayesian
classifier work best for this implementation. The fact that the
label distribution is not discrete but continuous clearly indicates
why multinomial naïve Bayes have abysmal performance.
Bernoulli have mixed performance in between because it takes
the top two from the label category and based its prediction label
as it assumes binary classification.
IV. CONCLUSION
We concluded that the performance of each variant of
Bayesian classifier is impacted by its assumption, and each
Figure 6. Variation in Train and Test Performance in Variant assumption is the single most important factor that influences
Bayesian Algorithm their performance and accuracy, as we can see in the normal and
continuous distribution of our label category which causes
Gaussian Bayes to work best among Bayesian variant on
It was important to visualize test and train accuracy, so as to security dataset. If the dataset is discrete like document
ensure the classifier doesn’t memorize the training data as that classification, we can expect multinomial to work best and
could cause a very wide gap between train and test accuracy Bernoulli to work best for binary classification such as
true/false, yes/no and so on. We showed that Gaussian Naïve Selection Method," 2022 IEEE Nigeria 4th International Conference on
Bayes algorithm performed best among the entire Bayes based Disruptive Technologies for Sustainable Development (NIGERCON),
Lagos, Nigeria, 2022, pp. 1-5, doi:
algorithm on intrusion detection, we also showed that each 10.1109/NIGERCON54645.2022.9803098.
variant of Bayes algorithm blindly follows its assumption which [10] S. Mukherjee and N. Sharma. Intrusion Detection using Naive Bayes
is the single most important factor that influences their Classifier with Feature Reduction. Procedia Technology, pp. 119–128,
performance. 2012
[11] A. Bhardwaj, S. S. Chandok, A. Bagnawar, S. Mishra and D. Uplaonkar,
REFERENCES "Detection of Cyber Attacks: XSS, SQLI, Phishing Attacks and Detecting
Intrusion Using Machine Learning Algorithms," 2022 IEEE Global
Conference on Computing, Power and Communication Technologies
[1] S. Shitharth, P. R. Kshirsagar, P. K. Balachandran, K. H. Alyoubi and A. (GlobConPT), New Delhi, India, 2022, pp. 1-6, doi:
O. Khadidos, "An Innovative Perceptual Pigeon Galvanized Optimization 10.1109/GlobConPT57482.2022.9938367.
(PPGO) Based Likelihood Naïve Bayes (LNB) Classification Approach
for Network Intrusion Detection System," in IEEE Access, vol. 10, pp. [12] Y. Li, X. Wusheng and R. Qing (2020). Research on the Performance of
46424-46441, 2022, doi: 10.1109/ACCESS.2022.3171660. Machine Learning Algorithms for Intrusion Detection System. CISAI
2020. Journal of Physics: Conference Series 1693 (2020) 012109. IOP
[2] A. Kelly and M. A. Johnson, "Investigating the Statistical Assumptions Publishing, 2020
of Naïve Bayes Classifiers," 2021 55th Annual Conference on
Information Sciences and Systems (CISS), Baltimore, MD, USA, 2021, [13] A. Yerlekar, N. Mungale and S. Wazalwar, "A multinomial technique for
pp. 1-6, doi: 10.1109/CISS50987.2021.9400215. detecting fake news using the Naive Bayes Classifier," 2021 International
Conference on Computational Intelligence and Computing Applications
[3] V. Vijay and P. Verma, "Variants of Naïve Bayes Algorithm for Hate (ICCICA), Nagpur, India, 2021, pp. 1-5, doi:
Speech Detection in Text Documents," 2023 International Conference on 10.1109/ICCICA52458.2021.9697244.
Artificial Intelligence and Smart Communication (AISC), Greater Noida,
India, 2023, pp. 18-21, doi: 10.1109/AISC56616.2023.10085511. [14] A. Yerlekar, N. Mungale and S. Wazalwar, "A multinomial technique for
detecting fake news using the Naive Bayes Classifier," 2021 International
[4] A. V. P, R. D and S. N. S. S, "Football Prediction System using Gaussian Conference on Computational Intelligence and Computing Applications
Naïve Bayes Algorithm," 2023 Second International Conference on (ICCICA), Nagpur, India, 2021, pp. 1-5, doi:
Electronics and Renewable Systems (ICEARS), Tuticorin, India, 2023, 10.1109/ICCICA52458.2021.9697244.
pp. 1640-1643, doi: 10.1109/ICEARS56392.2023.10085510.
[15] A. Kumar and S. Kumar, "Intrusion detection based on machine learning
[5] G. Singh, B. Kumar, L. Gaur and A. Tyagi, "Comparison between and statistical feature ranking techniques," 2023 13th International
Multinomial and Bernoulli Naïve Bayes for Text Classification," 2019 Conference on Cloud Computing, Data Science & Engineering
International Conference on Automation, Computational and Technology (Confluence), Noida, India, 2023, pp. 606-611, doi:
Management (ICACTM), London, UK, 2019, pp. 593-596, doi: 10.1109/Confluence56041.2023.10048802.
10.1109/ICACTM.2019.8776800.
[16] Ige, T., & Adewale, S. (2022a). Implementation of data mining on a secure
[6] V. K. V and P. Samuel, "A Multinomial Naïve Bayes Classifier for cloud computing over a web API using supervised machine learning
identifying Actors and Use Cases from Software Requirement algorithm. International Journal of Advanced Computer Science and
Specification documents," 2022 2nd International Conference on Applications, 13(5), 1–4.
Intelligent Technologies (CONIT), Hubli, India, 2022, pp. 1-5, doi: https://doi.org/10.14569/IJACSA.2022.0130501
10.1109/CONIT55038.2022.9848290.
[17] Ige, T., & Adewale, S. (2022b). AI powered anti-cyber bullying system
[7] A. A. Rafique, A. Jalal and A. Ahmed, "Scene Understanding and using machine learning algorithm of multinomial naïve Bayes and
Recognition: Statistical Segmented Model using Geometrical Features optimized linear support vector machine. International Journal of
and Gaussian Naïve Bayes," 2019 International Conference on Applied Advanced Computer Science and Applications, 13(5), 5–9.
and Engineering Mathematics (ICAEM), Taxila, Pakistan, 2019, pp. 225- https://doi.org/10.14569/IJACSA.2022.0130502
230, doi: 10.1109/ICAEM.2019.8853721.
[18] Amos Okomayin, Tosin Ige, Abosede Kolade , ” Data Mining in the
[8] A. H. Jahromi and M. Taheri, "A non-parametric mixture of Gaussian Context of Legality, Privacy, and Ethics ” International Journal of
naive Bayes classifiers based on local independent features," 2017 Research and Scientific Innovation (IJRSI) vol.10 issue 7, pp.10-15 July
Artificial Intelligence and Signal Processing Conference (AISP), Shiraz, 2023 https://doi.org/10.51244/IJRSI.2023.10702
Iran, 2017, pp. 209-212, doi: 10.1109/AISP.2017.8324083.
[9] T. Comfort Olayinka, C. Christian Ugwu, O. Joseph Okhuoya, A.
Olusọla Adetunmbi and O. Solomon Popoola, "Combating Network
Intrusions using Machine Learning Techniques with Multilevel Feature

Performance Comparison and Implementation of Bayesian Variants For Network Intrusion Detection

Uploaded by

Copyright:

Available Formats

Performance Comparison and Implementation of Bayesian Variants For Network Intrusion Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Comparison and Implementation of Bayesian Variants For Network Intrusion Detection

Uploaded by

Copyright:

Available Formats

Performance Comparison and Implementation of Bayesian

Variants for Network Intrusion Detection

1st Tosin Ige 2nd Christopher Kiekintveld

P(c|d) α P(c) π1<k<n P(tk|c) (1)

Where P(tk|c) is the conditional probability of tk . P(tk|c) is

Cmap=argmaxcecp(c|d) = argmaxcecp(c)π1<k<n P(tk|c) (2)

Figure 2: Bernoulli Algorithm Stepwise

Single occurrence of the word geography in physics

Figure 4. Confusion matrix of each Bayes on Security Dataset

which is over fitting, as seen in (figure 6) there is closeness

You might also like