Intrusion Detection System Using Machine Learning Techniques A Review

This document reviews research on using machine learning techniques for intrusion detection systems. It discusses different machine learning approaches like single classifiers, hybrid classifiers, and ensemble classifiers. It compares several papers based on the accuracy of results, commonly used classification algorithms, and datasets. The review aims to identify best practices and guide future work in this area.

Uploaded by

1mv20is036

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Intrusion Detection System Using Machine Learning Techniques A Review

Uploaded by

1mv20is036

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/350927892

A Review on Intrusion Detection System using Machine Learning Techniques

Conference Paper · February 2021

DOI: 10.1109/ICCCIS51004.2021.9397121

CITATIONS READS

9 328

4 authors, including:

Usman Musa Sudeshna Chakraborty

Towson University National Institute of Technology, Agartala
6 PUBLICATIONS 47 CITATIONS 13 PUBLICATIONS 184 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Usman Musa on 25 March 2023.

The user has requested enhancement of the downloaded file.

Proceedings of the International Conference on Smart Electronics and Communication (ICOSEC 2020)
IEEE Xplore Part Number: CFP20V90-ART; ISBN: 978-1-7281-5461-9

Intrusion Detection System using Machine

Learning Techniques: A Review
Usman Shuaibu Musa Megha Chhabra Aniso Ali Mandeep Kaur
Department of Computer Department of Computer Department of Computer Department of Computer
Science & Engineering Science & Engineering Science & Engineering Science & Engineering
Sharda University Sharda University Sharda University Sharda University
Gr. Noida, UP, India Gr. Noida, UP, India Gr. Noida, UP, India Gr. Noida, UP, India
usmanmusa04@gmail.com Megha.chhbr@gmail.com Eng.anisafiqi@gmail.com mandeep.kaur@sharda.ac.in

Abstract—The rapid growth in the use of computer networks that show deviation from the legitimate user parameters [2].
results in the issues of maintaining the network availability, The IDS generates logs and alert the network administrator
integrity, and confidentiality. This necessitates the network after the occurrence of malicious activity in a network [24].
administrators to adopt various types of intrusion de tection
systems (IDS ) that help in monitoring the network traffics for The intrusion detection system may be host based IDS
unauthorized and malicious activities. Intrusion is the breach of (HIDS) or network-based IDS (NIDS). The host-based
security policy with malicious intent. Therefore, intrusion intrusion detection system are adopted by network
detection system monitors traffic flowing on a network through administrators to monitor and analyze activities on a particular
computer systems to search for malicious activities and known machine. HIDS often have the advantage over NIDS in the
threats, sending up alerts when it finds those threats. The sense that an encrypted information can be accessed when
detection of malicious activities is of two types, the misuse or travelling over a network. Its disadvantage is that, HIDS is very
signature-based detection in which the IDS collects information, difficult to manage as there is need of configuring and
analyzes it and then compares it to the attack signatures stored in managing information for every host. Further, HIDS can be
a large database. The second detection is the anomaly detection disabled by certain types of denial of service attack. On the
which assumes malicious activity as any action that deviates from other hand, NIDS are software or hardware-based intrusion
normal behavior. The proposed paper presents an overview of
detection devices intelligently distributed within networks that
various works being done on building an efficient IDS using
passively monitors traffic flowing over the network through the
single, hybrid and ensemble machine learning (ML) classifiers,
evaluated using seven different datasets. The results obtained by
devices on which they reside. NIDS have dual interfaces one
various works were discussed and compared which gives a clear being used for listening to network conversation and the other
path and guide for future work.
for control and reporting. The NIDS have the advantage that
monitoring a large network may require a few well fit NIDS
Keywords— Intrusion Detection System, Machine Learning, and mostly NIDS is invisible to many attackers, th us it is
security, Anomaly detection, Misuse detection, Classifiers, secured against attacks. On the other hand, NIDS have the
Ensemble, Hybrid. disadvantage of finding it difficult to detect an attack stroke
during a period of high traffic.
I. INT RODUCT ION The earliest intrusion detection systems developed were
Intrusion Detection System (IDS) is used to detect threats majorly signature based, that is, the detection of malicious
or malicious activities. The IDS acts as a network level defense activities depends on the pre-defined and configured known
to secure a computer network. The intrusion or threat comes in signatures of known attacks, however this is a major setback as
a form of anomaly in a network. Intruders take advantage of the database of known attack signatures of such intrusion
network vulnerabilities such as weak security policies, software detection systems need to be constantly updated since intruders
bugs like buffer overflows, in exploiting the network flaws find a way to exploit network activities on frequent basis [8].
resulting in the violation of the network’s security. The As the machine learning came into existence, it made it
intruders might be system users with less privileges who intend possible to perform anomaly detection, which is to detect
to have more accessing authority or hackers who are common unknown attacks by comparing legitimate user parameters with
internet users that intend to steal or damage sensitive events that show deviation from such benign user activities.
information from the victim’s system [1]. The intrusion Over the years several machine learning techniques have been
detection techniques may be signature based or anomaly adopted with the purpose of improving the detection rate,
detection based. The signature-based detection monitors packet reducing false positives and increasing predictive accuracy of
flow in the network and compare them with the previously IDS. In this research review, it will discover how single,
identified, configured known signatures of known attacks. hybrid, and ensemble ML techniques have been performing in
Whereas anomaly detection technique detects attacks by the intrusion detection systems [7].
comparing defined legitimate user parameters with the events

978-1-7281-5461-9/20/$31.00 ©2020 IEEE 149

Authorized licensed use limited to: Towson University. Downloaded on February 24,2023 at 16:40:22 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Smart Electronics and Communication (ICOSEC 2020)
IEEE Xplore Part Number: CFP20V90-ART; ISBN: 978-1-7281-5461-9

The paper is organized into four sections; the first section decision. Thus, ensemble class ifier provides improved
covers the introduction of the paper content. The overview of performance by aggregating various results of weak learners.
the research papers is discussed in the second section which Several research works that adopted ensemble techniques
mentions various ML techniques used for building IDS. The show a great degree of accuracy and predictive performance.
third section provides the comparison of papers based on the Methods for constructing ensembles include: bagging and
result’s accuracy, frequently used classification algorithms and random forest, majority voting, randomness injection, feature-
datasets. Finally, the discussion, conclusion and considerations selection ensemble, and error-correcting output coding [10].
for future research work in ML based intrusion detection
systems have been discussed in the fourth and fifth section III. REVIEW AND COMPARISON OF RELATED WORKS
respectively.
In the work done by Alkasassbeh and Almseidin [9], three
classification techniques were used to address the issues of low
accuracy often faced by IDS that adopt artificial neural network
with fuzzy clustering when dealing with low frequent attacks.
They successfully improved the accuracy by splitting the
heterogeneous set of training data into homogeneous subsets of
training data thereby reducing complexity of each training set.
J48 trees, Multilayer Perceptron (MLP) and Bayes network
algorithms were used in the proposed work out of which J48
Fig 1: Block Digram of Intrusion Detection System
trees returns the best accuracy. One major drawback of their
work is their inability to apply feature selection so as to get rid
II. RESEARCH PAPER OVERVIEW of all unrelated, redundant and unwanted features .

A. Machine Learning (ML) An intrusion detection system based on single machine

learning classifiers was built by Bhavani et al [17] using
As a branch of Artificial Intelligence (AI), ML can be random forest and decision tree techniques on KDD-NSL
defined as a technique in which computers are trained to have dataset. The random classifier returns an accuracy of 95.323%,
the ability to automatically learn and improve or optimize thus the better result. Low detection as well as false positive
performance criterion without being explicitly programmed, rate were not solved by the proposed work [17]. Single
using past experience or example data. Machine learning Machine learning algorithms were used to detect network
model focuses on training set of data in accordance to features intrusion in the proposed work of Ponthapalli et al. The
of interest so that different classes can be predicted [22]. algorithms used in the work are; decision tree, logistic
Broadly, machine learning is categorized into supervised, regression, random forest and support vector machine [19].
unsupervised and reinforcement learning algorithms. KDD-NSL dataset was used in the work. The research showed
that the intrusion detection system performs best with random
B. Single Classifiers forest classifier. They also discovered that the random forest
Any classifier that is made up of only one classification classifier has the least execution time. The proposed work has
algorithm is known as Single Machine Learning Classifier. the limitation of performing efficiently only with a single
Several intrusion detection systems adopt the use of single dataset.
machine learning classification models. SVM, Artificial neural An ensemble-based approach IDS was implemented by
network, decision tree, KNN, Naïve Bayes are all single Marzia Z. and Chung-Horng L. [10] in which results of
machine learning classifiers and have been applied in different multiple supervised and unsupervised machine learning
intrusion detection systems studied in this review. algorithms were aggregated using voting classifier. The work
boosts the accuracy and performance of the current Intrusion
detection systems. They adopted Kyoto2006+ dataset which is
C. Hybrid Classifiers more promising than the most employable KDDCup ’99
The hybrid classifier is a combination of two or more ML dataset as it is relatively old. This makes their work to attain a
algorithms for the purpose of boosting the performance of the certain level of accuracy but the Recall of the result is quite
resulting or aggregated classifier in intrusion detection system. low in some few cases which indicate high values of false
The reason behind employing hybrid approach in the IDS is to negative rate (FPR).
amplify its efficiency as it is well proven that hybrid systems A real-time hybrid intrusion detection approach was
perform much more efficient than single machine learning proposed by Dutt I. et al [11] in which misuse approach was
classifying IDS. The first level of hybrid classifier can be used to detect well known attacks while anomaly approach to
represented by either supervised or unsupervised ML detect novel attacks. In this work a high detection rate was
algorithms [23]. achieved due to the fact that; patterns of intrusions that were
able to escape the misuse detection were able to be identified as
D. Ensemble Classifier attack by the anomaly detection technique. The model’s
Ensemble Classifier is a group of multiple machine accuracy increased incrementally each day up to a significant
learning classifiers often called weak learners whose value of 92.65% on the last day of the experiment, also, as the
individual decisions are combined in some manner to provide model learns and trains the system each day, the rate of false
better efficient predictive performance as a consensus

negative decreases sharply. The issue of slow detection rate [18]. A feature selection was applied in this work that generate
persists when the model is applied on a very big size data. and used only the most relevant feature subsets for the adopted
A work done by Verma et al [12] shows that anomaly based datasets. The results obtained by the research showed that the
intrusion detection has a room for improvement especially in ensemble classifier approach performs well over single
the false positive rate. Extreme gradient boosting (XGBoost) machine learning technique with a misclassification gap of
and Adaptive boosting (AdaBoost) learning algorithms were 1.19% and 1.62% using NSL-KDD and UNSW NB-15 datasets
applied on NSL-KDD dataset. Though an accuracy of 84.253 respectively. The issue of large data size, high dimensionality,
was obtained, an improvement in the performance needs to be and standard performance of IDS techniques need to be
done by applying hybrid or ensemble machine learning addressed further in upcoming researches.
classifiers. A stacking ensemble approach using heterogeneous
Some of the previously proposed works have the limitation datasets was proposed by Rajagopal et al. The ensemble
of inability to apply feature selection on datasets they worked technique consists of Logistic regression, K-Nearest neighbor,
with to eliminate all irrelevant, unwanted and redundant random forest and support vector machine. The work made use
features. In the work proposed by Kazi Abu Taher et al [13], of the most updated dataset in UNSW NB-15 and UGR ’16.
different ML models with different ML algorithms were The UNSW NB-15 was captured in emulated environment
evaluated with NSL-KDD dataset, feature selection was while UGR ’16 was captured in real network traffic
applied using wrapper method. An improved accuracy was environment [20]. The stacking ensemble approach boosted
obtained relatively better than the one obtained by the previous prediction accuracy and detection speed of the IDS. The model
works that adopted the same dataset. A major drawback of zero returns the highest accuracy when UGR ’16 was used with an
day detection remains unsolved due to the high false positive accuracy of 98.71%. However, more experiments need to be
rate of the model as well as the work focused only on signature done on different datasets that include the most recent attack
based attacks thereby leaving novel attacks undetected. categories.
Some works being done on previous intrusion detection Perez D. et al, proposed a hybrid network based intrusion
systems lack ability to work efficiently on different datasets. detection system (IDS) using multiple hybrid machine learning
The proposed work of Zhou et al [14] presented a novel techniques that work on NSL-KDD dataset [8]. The supervised
intrusion detection system that brings the benefit of combing machine learning technique, Neural Network was combined
ensemble classifier with feature selection, this provides an with unsupervised machine learning, K-Means clustering with
improved efficiency and high accuracy detection of intrusions. feature selection. Another combination was made consisting of
The work was carried out using three different datasets; the support vector machine (SVM) with K-means clustering. The
familiar NSL-KDD dataset and two recently published datasets results clearly showed that the combination of such supervised
i.e. CIC-IDS2017 and AWID. For feature selection, CFS-BA and unsupervised machine learnings complement each other
based approach was used. The ensemble based approach which boosts the performance of IDS. The combination of
increases the multiclass classification performance on SVM and K-means with feature selection returns the best
unbalanced datasets. The model showed the highest accuracy accuracy. More hybrid based models need to be built to
on AWID dataset, giving an accuracy of 99.90%. improve the false positive rate.
Ahmad Iqbal and Shabib Aftab [15] make use of both feed
forward neural network and pattern recognition neural network. A. Comparison Of Related Work
In addition, they applied Bayesian regularization and scaled In this research review, several papers have been studied
conjugate gradient training techniques to train the artificial from year 2015 to 2020. Single, hybrid and ensemble
neural network based IDS. Various performance metrics were classifiers have been widely used in the studied proposed
used to evaluate efficiency and capacity of the proposed work. works on the intrusion detection systems. Table 1 describes the
The two models were found to outperform each other in comparison majorly in terms of accuracy between different
different performance measures on various attack detections algorithms adopted in the studied research articles.
from the yielded result. Overall, the feed forward artificial
neural network provided the better accuracy of 98.0742%. The
efficiency of the work needs to be improved by testing the
model on different datasets.
An ensemble based approach that combine decision tree,
Bayes classifier, RNN-LSTM, random forest was proposed by
Vinoth Y. K and Kamatchi K. [16]. This work contributed in
handling imbalanced data by choosing the most required
effective features to be trained to detect intrusion and send alert
to system administrators as to whether the intrusion is a normal Fig 2: Grouping of Research papers based on type of classifier used.
or abnormal behavior. Though the models performs to some
extent of accuracy on NSL-KDD, an experimental trial on the As it can be clearly observed in figure 3, ensemble classifier
most updated datasets need to be carried out.
returns with highest accuracy whenever it is employed over
Maniriho et al proposed a work on intrusion detection
the years.
system in which single machine learning classifier (K-Nearest
Neighbor) and Ensemble technique (Random committee) were
used on two different datasets, NSL-KDD and UNSWN B-15

B. Datasets Used In The Research Works NSL-KDD had solve some of the problems of the original
Dataset is the collection of instances. An instance is the term KDD Cup ’99 dataset. It is made up of 41 features out of which
used to describe a single row of data. Each instance is made up 38 are numeric attributes while only three are nominal
of multiple features often called attribute of a data instance. attributes. It also contains 24 training attack types with the
The most popular datas et used in the studied work is the KDD- dataset having additional 14 attack types. The training set has a
NSL. total number of 125973 data points. 53.4% of the training set
data points are classified as normal connections while the rest
In total, seven different datasets have been adopted in those (47.6%) are classified as attack [12]. Kyoto2006+ dataset was
papers that include; KDD Cup ’99, KDD-NSL, Kyoto2006+, built based on the real traffic data collected using 348
AWID, CIC-IDS2017, UNSW NB-15 and UGR’16 datasets. honeypots in Kyoto University for three years. This dataset
KDDCup ’99 data ’99 was firstly used for the KDD Cup 99 contains 24 features out of these, 14 features are the same as in
competition. The dataset has a total of 41 features constituted the original KDD Cup ’99 dataset. The rest 10 additional
in each input pattern record which represents TCP connection. features that contains six information related features bring up
The features are both qualitative and quantitative in nature light to some challenges often faced when KDD Cup ’99
[21]. As the modified version of KDD Cup ’99 dataset, the dataset is employed [10].
T ABLE 2: Distribution of Dataset Usage over the Years
TITLE ALGO RITHM DATASET RESULT (ACCURACY) FINDING DRAWBACK

IDS using bagging 1) Genetic Algorithm (GA) based NLS- Bagged Naïve Reduced high false High time was
with partial decision feature selection. KDD99 Bayes=89.4882% alarm required to build the
tree base classifier[1] 2) Bagged Classifier with partial Naïve Bays=89.6002% model
decision tree PART =99.6991%
C4.5=99.6634%
Bagged C4.5= 99.7158%
Bagged PART =99.7166%
IDS based on 1) k-Nearest Neighbor (k-NN) KDD- CANN=99.76% Feature U2L and R2L attacks
combining cluster 2) Cluster Center and Nearest Cup99 KNN=93.87% representation was were not effectively
centers and nearest Neighbor (CANN) 3) Support SVM=80.65% applied for normal detected by CANN
ne ighbors[2] Vector Machine connections and
attacks
Comparison of 1) Breadth-Forest Tree (BFTree) NSL-KDD BFT ree=98.24% Achieved reduction T here is need to
classification 2) Naïve Bayes Decision T ree NBT ree=98.44% in false positive evaluate the model
te chniques applied (NBT ree) J48=97.68% on the most updated
for network 3) J48 RFT =98.34% datasets.
intrusion detection 4) Random Forest Tree (RFT) MLP=98.53%
and classification[3] 5) Multi-Layer Perceptron (MLP) NB=84.75%
6) Naïve Bayes
Random Forest Random forest (RF) based NSL-KDD 99.67% T he model is A feature selection
Mode ling for ensemble classifier efficient as it returns method like
Network IDS[4] a low false alarm evolutionary
and high detection computation needs to
rate be applied to
improve accuracy
Anomaly Detection 1) Genetic Algorithm (GA) KDDCup GA=84.0333% Low false positive T rials on different
Based on Profile 2) Support Vector Machine ‘99 SVM=94.8000% rate datasets need to be
Signature in (SVM). Hybrid done
Network using 3) Hybrid Model (GA+SVM)=98.333%
Machine Learning
Te chniques[5]
Fast KNN K-Nearest Neighbor (KNN) NSL-KDD 99.95% High accuracy High computational
Classifiers for achieved time due to inability
Network Intrusion to apply feature
Detection System[6] selection
Machine Learning Equality-constrained- NSL-KDD 98.82% Improved detection T he work needs to
Based Network optimization-based Extreme rate and be carried out on
Intrusion learning machines (C-ELMs) computational speed different datasets
Detection[7]
Intrusion detection Hybrid model of supervised NSL-KDD SVM+K-Means=96.81% Combination of Similar approach
in computer (Neural Network (NN), Support NN+K-Means=95.55% supervised and need to be applied on
networks using Vector Machine (SVM)) and unsupervised the most updated
hybrid machine unsupervised (K-Means) machine learning algorithms datasets
le arning learning algorithms. complement each
te chniques [8] other in improving
IDS performance
Machine Learning 1) J48 T ress. KDD ‘99 J48=93.1083%. Addressed the issue Feature selection
Methods for 2) Multilayer Perceptron (MLP). MLP=91.9017% of accuracy in was not applied
Network 3) Bayes Network Bayes Network=90.7317% detecting low

Intrusions[9] frequent attacks

Evaluation of 1) K-Means Kyoto2006+ RBF=97.54% A more updated and Recall of the result is
Machine Learning 2) K-Nearest Neighbor (KNN) KNN=97.54% promising dataset in quite low
Te chniques for 3) Fuzzy C-Means (FCM) Ensemble=96.72% kyoto2006+ was
Network Intrusion 4) Support Vector Machine NB=96.72% used
Detection[10] (SVM) SVM=94.26%
5) Naïve bayes (NB) FCM=83.60%
6) Radial Basis function (RBF) K-Means=83.60%
7)Ensemble comprising the six
classifiers
Real Time Hybrid Hybrid approach that comprise KDD T rue Positive (TP)=92.65% T he hybrid approach T he model showed
Intrusion Detection 1)Frequency Episode Extraction: Cup’99 used helped in slow detection rate
System[11] 2) Chi-Square Analysis achieving a high when it was applied
detection rate on a big size data
Network Intrusion 1) Extreme Gradient Boosting NSL-KDD XGBoost with T he work showed T he ensemble of the
Detection using (XGBoost) Clustering=84.253% that anomaly classifiers used
Clustering and 2) Adaptive Boosting (AdaBoost) XGBoost without detection has a room needs to be evaluated
Gradient boosting [12] Clustering=80.238 in improving its on the most updated
AdaBoost with false positive datasets that contains
Clustering=82.011% recent attacks
AdaBoost without
Clustering=80.731%
Network Intrusion 1) Artificial Neural Network NSL-KDD ANN=94.02% High accuracy was Inability of the work
Detection using (ANN) achieved due to t he to address the issue
Supervised Machine 2) Support Vector Machine application of of zero day attack
Learning Technique (SVM) feature selection due to high false
with feature positive rate
selection[13]
Building an 1)Correlation based feature 1)NSL- Ensemble (NSL- T he model was False positive was
Efficient Intrusion selection (CFS-BA) KDD KDD)=99.80% evaluated on three observed in CIC-
Detection System[14] 2)Ensemble approach that 2)AWID Ensemble (AWID)=99.50% different datasets IDS2017 dataset
comprise: C4.5, Random Forest 3)CIC- Ensemble (CIC- and returns with an
(RF) and Forest by Penalizing IDS2017 IDS2017)=99.90% improved efficiency
(Forest PA) and high detection
rate.
A Feed-Forward 1)Feed forward Neural Network NSL-KDD FFANN=98.0792% T he work showed T he model needs to
ANN and Pattern (FFANN) PRANN=96.6225% that combining be evaluated on
Recognition ANN 2) Pattern Recognition Neural multiple classifiers different datasets to
Model for Network Network (PRANN) complement each improve its
Intrusion other in improving efficiency
Detection[15] performance
Anomaly Based 1)Decision Tree NSL-KDD Ensemble=85.20% T he work handled T rial on the most
Network Intrusion 2) Bayes Classifier imbalanced data and updated datasets
Detection using 3) RNN-LST M selected only needs to be carried
Ensemble Machine 4)Random Forest required features out
Learning 5) Ensemble of the 4 classifiers which greatly helped
Te chnique[16] in reducing high
false positive rate
Network Intrusion 1) Random Forest (RF) NSL-KDD RF=95.323% Easily implemented Slow detection rate
Detection System 2) Decision Tree (DT) DT =81.868% and high false
using Random positive
Forest and Decision
Tree Machine
Learning
Te chniques[17]
Detecting Intrusions 1)Single Machine Learning 1)NSL- NSL-KDD using Ensemble approach Fail to address the
in Computer Classifier (K- Nearest Neighbor KDD 1)KNN=98.727% generate better problem of data high
Network Traffic (KNN)) 2) UNSW 2) NSL-KDD using accuracy than single dimensionality
with Machine 2) Ensemble T echnique (Random NB-15 RC=99.696% classifiers.
Learning Committee (RC)) 3) UNSW NB-15 using T he model was
Approaches [18] KNN=97.3346% evaluated using two
4) UNSW NB-15 using different datasets.
RC=98.955%
Implementation of 1)Decision Tree (DT) KDD-NSL 1) RF=73.784% Showed that T he model performs
Machine Learning 2) Logistic Regression (LR) 2) DT =72=303% working with efficiently only with
Algorithms for 3) Random Forest (RF) 3) SVM=71.779% random forest in single classifier
Detection of 4) Support Vector Machine 4) LR=68.674% building IDS saves
Network (SVM) execution time
Intrusion[19]

A Stacking Stacking Ensemble technique that 1)UNSW 1) UNSW NB-15=94.00% Boosted prediction T he work needs to
Ensemble for NIDS comprises: KNN, LR, RF and NB-15 2) UGR ‘16=98.71% accuracy and be evaluated on
using SVM 2) UGR ‘16 detection speed was multiple datasets
Heterogeneous observed
Datasets [20]
Other datasets include AWID, an acronym stands for Aegean Single machine learning classifiers perform better when
Wi-Fi Intrusion Dataset, consists of real traces of both they are combined in a specific manner, therefore the hybrid
normal and malicious data obtained from real network and ensemble machine learning classification techniques
environment. AWID was publically available in 2015 as a need to be used more often.
collection of sets of Wi-Fi network data (Zhou et al, 2019). Some classifiers performs better on specific datasets, in the
The CIC-IDS2017 dataset contains normal and the recent coming researches, more models need to be developed in
common attacks. It was established by Canadian Institute for such a way that they can be able to perform efficiently on
Cyber security (CIC) in the year 2017. It is one of the newest multiple datasets.
intrusion detection dataset. It is made up of 2,830,743
A few of the studied research articles have not applied
records distributed on 8 different files and each record has 78
different labelled features [14]. The UNSW NB-15 dataset feature selection before the classification stage whereas
was established by a cyber-security research group at the others have adopted the use of feature selection approaches.
Australian center for cyber security. The acronym UNSW The feature selection has to be considered in the coming
NB-15 stands for University of New South Wales. The researches in order to get rid of irrelevant, unwanted and
dataset has a total of 47 features with two class labels [20]. redundant features to improve the efficiency and detection
Table 2 and fig 4 show how the datasets have been rate of IDS.
adopted by the studied research articles over the years. NSL- As it was discussed in section 3 that 58.33% of the
KDD has been used a total 14 times which makes 58.33% of studied research articles had KDD-NSL as dataset. More
the total datasets usage. It was followed by the original KDD recently updated datasets need to be worked with in the
Cup 99 dataset which was used 4 times. The UNSW NB-15 future research for the purpose of dealing with the most
was used twice whereas each of Kyoto2006+, AWID, CIC- recent malicious intrusions and attacks.
IDS 2017 and UGR ’16 have been used once.
T ABLE 2: Distribution of Dataset Usage over the Years
KDD Cup NSL- CIC -IDS UNSW UGR
Year 99 KDD 2017 NB-15 ‘16
2015 1 1 0 0 0
2016 1 2 0 0 0
2017 0 3 0 0 0
2018 2 1 1 0 0
2019 0 3 0 1 0
2020 0 4 0 0 2

Fig 3: Comparison of Classifiers in terms of Accuracy

V. CONCLUSION
The emergence of machine learning presents new techniques
for intrusion detection systems in which various types of
classifies have been adopted by researchers and scholars in
building intrusion detection systems models. This paper
presented various research papers related to using machine
Fig 4: Distribution of dataset usage over the years learning classifiers in intrusion detection systems published
from 2015 to 2020. Among the various models applied in the
IV. DISCUSSION AND FUTURE WORK studied research papers, ensemble and hybrid classifiers have
been able to surpass their single classifier counterpart and
As it is shown in fig 3, ensemble and hybrid classifiers have
hence have the better predictive accuracy and detection rate.
better predictive accuracy and detection rate than single
classifiers. For future research work, the following issues
have been identified and need to have more consideration in REFERENCES
order to improve the performance of intrusion detection [1] D.P. Gaikwad and Ravindra C. T hool. (2015). Intrusion detection
system using bagging with partial decision tree base classifier.
systems: Procedia Computer Science 49 (pp. 92-98). Elsevier.)

[2] W. –C. Lin, Shih-Wen K. Chih-Fong T . (2015). Intrusion detection [21] Bolon –C.V. (2012) Feature Selection and Classification in Multiple
system based on combining cluster centers and nearest neighbors. class datasets-An application to KDD Cup 99 dataset.
Knowledge-Based Systems 78 (pp. 13-21). Elsevier. https://doi.org/10.1016/j.eswa.2010.11.028
[3] A.S.A. Aziz. (2016). Comparison of classification techniques applied [22] ] Farah N. H et al. (2015). Application of Machine Learning
for network intrusion detection and classification. Journal of Applied Approaches in Intrusions Detection Systems. International Journal of
Logic 24. Elsevier, 109-118. Advanced Research in Artificial Intelligence. IJARAI. (9-18).
[4] Nabila Farnaaz and M.A Jabbar. (2016). Random Forest Modeling for [23] N. F. Haq et al. (2015). An Ensemble framework for anomaly
Network Intrusion Detection System. International Multi-conference detection using hybridized feature selection approach (HFSA).
on information processing (IMCIP) 12 (pp. 213 -217). Elsevier. Intelligent System Conference. (pp. 989-995). IEEE.
[5] Kayvan A. Saadiah Y. Amirali R. and Hazyanti S. (2016). Anomaly [24] S. T hapa and A.D Mailewa (2020). T he Role of Intrusion
Detection Based on Profile Signature in Network using Machine Detection/Prevension Systems in Modern Computer Networks: A
Learning T echniques. IEEE T ENSYMP. (pp. 71 -76). IEEE. Review. Conference: Midwest Instruction and Computing
[6] Bobba Brao and Kailasam Swathi. (2017). Fast KNN Classifiers for Symposium (MICS). Wisconsin, USA. Volume: 53. (pp. 1-14).
Network Intrusion Detection System. Indian Journal of Science and
T echnology. 10(14). Researchgate. (1-10).
[7] Chie-Hong L. Yann-Yean S. Yu-Chun Lin and Shie-Jue L. (2017).
Machine Learning Based Network Intrusion Detection. 2nd IEEE
International Conference on Computational Intelligence and
Applications. (pp. 79-83). IEEE.
[8] Deyban P. Miguel A. A, David P. A, and Eugenio S. (2017). Intrusion
detection in computer networks using hybrid machine learning
techniques. XLIII Latin American Computer Conference (CLEI). (pp.
1-10). IEEE
[9] Alkasassbeh and Almseidin. (2018). Machine Learning Methods for
Network Intrusions. International Confrernce on Computing,
Communication (ICCCNT ). Arxiv
[10] Marzia Z. and Chung-Horng L.(2018). Evaluation of Machine
Learning T echniques for Network Intrusion Detection. IEEE. (pp. 1-
5)
[11] Dutt I. et al. (2018). Real Time Hybrid Intrusion Detection System.
International Conference on Communication, Devices and
Networking (ICCDN). (pp. 885-894). Springer.
[12] Verma P, Shadab K, Shayan A. and Sunil B. (2018). Network
Intrusion Detection using Clustering and Gradient Boosting.
International Conference on Computing, Communication and
Networking T echnologies (ICCCNT ). (pp. 1-7). IEEE.
[13] Kazi A., Billal M. and Mahbubur R. (2019). Network Intrusion
Detection using Supervised Machine Learning T echnique with feature
selection. International Conference on Robotics, Electrical and Signal
Processing T echniques (ICREST ). (pp. 643-646). IEEE.
[14] Yuyang Z., Guang C., Shanqing J. and Mian D. (2019). Building an
Efficient Intrusion Detection System Based on Feature Selection and
Ensemble Classifier. Computer Networks. Doi:
https://doi.org/10.1016/j.comnet.2020.107247
[15] ] Iqbal and Aftab. (2019). A Feed-Forward ANN and Pattern
Recognition ANN Model for Network Intrusion Detection.
International Journal of Computer Network and Information Security,
4. Researchgate (19-25)
[16] Vinoth Y. and Kamatchi K. (2020). Anomaly Based Network
Intrusion Detection using Ensemble Machine Learning T echnique.
International Journal of Research in Engineering, Science and
Management. IJRESM. (290-296).
[17] Bhavani T . T , Kameswara M. R and Manohar A. R. (2020). Network
Intrusion Detection System using Random Forest and Decision T ree
Machine Learning T echniques. International Conference on
Sustainable T echnologies for Computational Intelligence (ICST CI).
(pp. 637-643). Springer.
[18] Maniriho et al. (2020). Detecting Intrusions in Computer Network
T raffic with Machine Learning Approaches. International Journal of
Intelligent Engineering and Systems. INASS. (433-445)
[19] Ponthapalli R. et al. (2020). Implementation of Machine Learning
Algorithms for Detection of Network Intrusion. International Journal
of Computer Science Trends and T echnology (IJCST ). (163 -169).
[20] Rajagopal S., Poornima P. K. and Katiganere S. H. (2020). A
Stacking Ensemble for Network Intrusion Detection using
Heterogeneous Datasets. Journal of Security and Communication
Networks. Hindawi. (1-9).

Authorized licensed use limited to: Towson University. Downloaded on February 24,2023 at 16:40:22 UTC from IEEE Xplore. Restrictions apply.
View publication stats