Icoei48184 2020 9143028
Icoei48184 2020 9143028
Icoei48184 2020 9143028
Abstract—In the previous, not many years, digital assaults classification problem. In this paper, only the binary
have become a significant issue in cybersecurity. Researchers are classification will be discussed. There are several datasets for
taking a shot at the intrusion detection framework from the most intrusion detection systems that can be used as a benchmark,
recent couple of decades and numerous methodologies have been but trained models based on these old datasets fail to detect
developed. Yet at the same time, these methodologies won't be anomalous web traffic and hence it is necessary to update these
adequate for the intrusion detection framework in the up and data after a certain interval of time. Researchers have used a lot
coming days. Along these lines, in light of headways in
of machine learning algorithms to create intrusion detection
innovation, the current framework has to be refreshed with systems. Some classical machine learning algorithms such as
another one. In this paper, ensemble learning strategies have
been examined for the intrusion detection system were boosting
support vector machine (SVM), Decision Tree (DT), K-Nearest
and bagging methods like Distributed Random Forest (DRF),
Neighbors (KNN), etc. have been used so far. Also, some
Gradient Boosting Machine (GBM) and XGBoost are feature selection techniques like Filter Methods, Wrapper
implemented using python library H2O for the new Intrusion Methods, Embedded Methods, etc. are used to improve these
identification framework. The Deep Neural Network (DNN) is models. In this paper, some machine learning algorithms have
likewise executed using the H2O library and found that our been used which have become very popular in recent years.
model beats the past aftereffect of Deep Neural Network (DNN) Ensemble models are used very frequently for solving real-
after utilizing the feature selection method genetic algorithm. world problems and many times it outperforms all other
Our outcomes outperform the numerous old-style machine models. Since many feature selection techniques have been
learning models too. used for better results. Researchers have used the bio-inspired
algorithm to find out the combination of best features for the
Keywords— cybersecurity, intrusion detection, machine intrusion detection system. So, a very special kind of
learning, distributed random forest, gradient boosting machine, metaheuristic optimization technique called genetic algorithm
xgboost, deep neural network, ensemble methods, h2o for feature selection has been used. In this paper, ensemble
methods and neural networks have been focused on. In many
I. INT RODUCT ION cases, ensemble methods and neural networks work better than
In the year 2018, some cyber-attacks have drawn the other machine learning models. In literature, these models
along with genetics algorithm have never been used in the
attention of everyone to the flaws of Cyber Security. These
cyber-attacks are clear that what the intrusion detection system intrusion detection systems . Also, our results are better than
previous results of the deep neural network after using the
is not enough to prevent these attacks. In the present era,
genetic algorithm for feature selection.
peoples are more close to devices connected through the
internet. Due to which the threat of these attacks has increased.
So, to recognize these attacks, the researchers have been II. RELAT ED WORK
working on it for many years. As the web traffic remains In earlier studies, machine learning models like Support
abnormal at the time of cyber-attacks, which are rarely seen. Vector Machine (SVM), K-Nearest Neighbor(KNN), Decision
Researchers have considered these web traffic as anomalous
Tree (DT), Random Forest, etc., and hybrid algorithms
and have started to work on so many anomalies detection
whereas in deep learning Artificial Neural Network (ANN),
methods. Initially, rules-based intrusion detection techniques
Convolutional Neural Networks (CNN), Recurrent Neural
were used. Later they used data mining, machine learning, and
deep learning methods as well. In literature, attacks are Network (RNN), Deep Belief Network (DBN) have been
categories in four different groups namely Denial of Service implemented for the intrusion detection system. Many
(DOS), User to Root (U2R), Probe (Probing) and Root to Local researchers implement similar models and hence here only a
(R2L). In machine learning, anomaly detection can be seen in few of them have been mentioned. The Support Vector
two ways. Firstly Web traffic can be either normal or Machine (SVM) performs well for almost all researchers.
abnormal, and it can be called binary class classification Kotpalliwar and Wajgi [7] used a Support Vector Machine
problem and secondly, the process of identifying four types of (SVM) on a few portions of the KDD Cup 99 dataset and got
abnormal web traffic can be treated as a multiclass 99.9% accuracy on the validation data. Even Saxena and
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 24,2020 at 06:32:30 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on Trends in Electronics and Informatics (ICOEI 2020)
IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0
Richariya [24] used SVM with Particle Swarm Optimization whereas the remaining two are the validation dataset.
(PSO) for feature selection and got quite good results. Another KDDTest -21 is more complex than KDDTest + dataset. The
method K-Nearest Neighbor (KNN) and K-means have been dataset has a total of 41 features and a target outcome having
implemented by Sharifi et al. [19] where they got around 90% values normal and anomaly. Among all features, 3 features are
accuracy. The tree-based model's decision tree was used by categorical whereas remaining are continuous.
Ingre et al. [20] where they used Correlation Feature Selection
(CFS) such that 14 features were extracted. C4.5 was also
used by Rai and Devi [22] with 79.52% of accuracy.
Unsupervised deep learning models like Deep Belief Network
(DBN) with Logistic Regression were used by Alrawashdeh
and Purdy [21] where accuracy was 97.9% and this was
performed using a few portions of KDD Cup 99 dataset. Yin
et al. [23] gave an accuracy of 83.28% on the validation
dataset after implementing the Recurrent Neural Network
(RNN) on the NSL-KDD dataset. Author Santhosh and
Arghir-Nicolae [9] have used the library H2O for the first time
in intrusion detection to implement deep neural network
(DNN) and they got 83% accuracy with NSL-KDD dataset.
Not only the above-mentioned models but many researchers
used a different combination of models to improve the
previous results. To improve models trying with the different
machine learning algorithms is not sufficient. Data pre-
processing and feature engineering steps makes a crucial
difference in performance. Researchers have tried many
feature selection techniques to improve the models for the
intrusion detection system. Not only the most frequently used
techniques but some bio-inspired algorithms as well. Aghdam
and Kabiri [6] proposed a method for feature selection which
is based on an ant colony optimization technique for the
intrusion detection system. Another bio-inspired algorithm
called Whale Optimization Algorithm (WOA) was used by
Sukriti and Batla on the same NSL-KDD dataset. Also, other Figure 1. Flow chart for the proposed methodology
algorithms firefly, particle swarm optimization, Eigenvector
Centrality, genetic algorithm, etc. were used by researchers
previously. B. Data pre-processing
III. MET HODOLOGY Before modeling, a very prominent step data pre-processing is
A problem with the Decision Tree algorithm is that it is required. While pre-processing, did not found any missing
more likely to be over-fitting. So to overcome the problem of values in data, but some features were categorical. That's why
overfitting, more than one tree can be used in an algorithm. first encoded the label and categorical features and then
Some algorithms that use lots of trees are Distributed Random perform one-hot encoding, after which got a new feature
Forest (DRF), Gradient Boosting Machine (GBM) and matrix that has 122 features.
XGBoost. Another algorithm Deep Neural Network (DNN) is Since the range of features is not on the same scale and hence
well known for extracting hidden features from data and will feature scaling was required. Well, there are many ways of
use these algorithms and study their performance. feature scaling, but used Standardization. After
standardization, each features having a mean 0 and standard
A. dataset deviation 1. The below formula is used for standardization.
A very popular NSL-KDD dataset for our experiment has
been used. NSL-KDD dataset has been used by many Standardization= …. (1)
researchers as a benchmark. KDD cup 1999 dataset is one of
the most widely used benchmark datasets for the intrusion Where, x - input value,
detection system. It is based on DARPA 1998 dataset. But this μ - mean,
KDD cup 99 datasets was having inherent redundant records - Standard deviation
problems. In 2009 new dataset NSL-KDD was generated
without the problems which have in the KDD cup 1999 dataset. One of the biggest advantages of using tree-based models is
For our experiment, the NSL-KDD dataset. NSL-KDD dataset that it is not sensitive to missing values and outliers. So will
is used having three different datasets namely KDDTrain +, not discuss it.
KDDTest +, KDDTest -21 .KDD train is a training dataset
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 24,2020 at 06:32:30 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on Trends in Electronics and Informatics (ICOEI 2020)
IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0
C. Feature selection
One way to reduce overfitting is to remove low importance
features. Since 122 features have from which can remove
some features so that the model complexity will be reduced
and the chances of overfitting will also be reduced. There are
many types of metaheuristics algorithms, one of which is the
genetic algorithm will use for the feature selection. After using
the genetic algorithm left with only 43 features. A genetic
algorithm is search-based metaheuristics optimization
techniques inspired by nature. It is based on Charles Darwin’s
theory of natural evolution. In the genetic algorithm, have the
terms namely initial population, fitness value, selection,
crossover, offspring, and mutation. In the initial population,
have some fixed individuals also known as chromosomes.
Chromosomes are sets of genes arranged in an array. So, the
initial population is a set of chromosomes and chromosomes Figure 3. Fitness value over different iterations
are a set of genes. Inside the genes, have a binary value either
0 or 1.
In particular for feature selection, chromosomes are nothing
but a set of features where 1 represents the variables are taken
and 0 represents variables are not taken and have kept the
population size 50 that means there is a total of 50
combinations of variables. Since the genetic algorithm is
computationally expensive, hence used 10 iterations only. In
the above plot, green shows the best fitness value and the
dotted blue line represents the mean fitness value. The
original author has used the below formula for fitness function
and also use the same.
Figure 2. The internal architecture of the genetic So, first of all, individuals1 having variables 1 and 0 where
1 represents the variable and 0 their absence. In the first
algorithm
iteration, individuals1 used for modeling and used the random
forest for modeling. After the first iteration, the ROC of
In the genetic algorithm, first of all, took a possible solution to random forest divided by the total number of features that give
a problem. These solutions are chromosomes. The fitness fitness value for individual1 and similarly for other individuals
function is defined such that fittest chromosomes are as well. So the selection is based on the ROC value. Then the
evaluated and some fitness score is given to each individual. next steps executed and got the set of best features 43 out of
The fitness score then used to rank the individuals, and select 122. To implement the genetic algorithm for feature selection
the two fittest individuals for the next step and those two the R language has been used.
individuals named as parents. In the next step, randomly
choose a point where genes are interchanged in chromosomes. D. Ensemble Learning
If choose point 2 then first and second genes of individual1
Due to the complexity of data, classical machine learning
will exchange the binary values with individual2 and this step
techniques are not always sufficient for modeling. In most
is known as crossover. The next step is offspring which are
cases, it's over fitted depending on the data, so ensemble
nothing but two new individuals which are formed by
techniques can be the solution. The ensemble method can be
exchanging the genes between individual 1 and individual 2
categorized into basic and advanced methods. Max voting,
and then new individuals 3 and individuals 4 is added to the
Averaging, Weighted Average are the basic ensemble
population. The final step is a mutation where just flip the
methods, where the different algorithms trained on data and
binary values 0 to 1 or vice versa to maintain the diversity in
after averaging, the results got more powerful models.
the population. This process is stopped when new offspring
Advanced ensemble techniques are categorized in Stacking,
are not different from the previous one. Finally, the last left
bagging, boosting and blending. Further, these groups can also
individuals are our solution.
be categorized into different groups. Here, bagging and
boosting techniques for modeling will be discussed.
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 24,2020 at 06:32:30 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on Trends in Electronics and Informatics (ICOEI 2020)
IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 24,2020 at 06:32:30 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on Trends in Electronics and Informatics (ICOEI 2020)
IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0
= ……(11)
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 24,2020 at 06:32:30 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Fourth International Conference on Trends in Electronics and Informatics (ICOEI 2020)
IEEE Xplore Part Number: CFP20J32-ART; ISBN: 978-1-7281-5518-0
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on July 24,2020 at 06:32:30 UTC from IEEE Xplore. Restrictions apply.