Intrusion Detection Using Big Data and Deep Learning Techniques

Uploaded by

umarjaved

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views

Intrusion Detection Using Big Data and Deep Learning Techniques

Uploaded by

umarjaved

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/332100759

Intrusion Detection Using Big Data and Deep Learning Techniques

Conference Paper · April 2019

DOI: 10.1145/3299815.3314439

CITATIONS READS
6 1,126

2 authors, including:

Erdogan Dogdu
Angelo State University
82 PUBLICATIONS 646 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

ATM Menu Optimization View project

Social Media Analytics View project

All content following this page was uploaded by Erdogan Dogdu on 30 March 2019.

The user has requested enhancement of the downloaded file.

Intrusion Detection Using Big Data and Deep Learning
Techniques
Osama Faker Erdogan Dogdu
Cankaya University Cankaya University
Ankara, Turkey Georgia State University (adjunct)
osamafakre@gmail.com Ankara, Turkey
edogdu@cankaya.edu.tr

ABSTRACT 1 INTRODUCTION
In this paper, Big Data and Deep Learning Techniques are inte- Providing protection and privacy of big data is one of the most
grated to improve the performance of intrusion detection systems. important challenges facing developers of security management
Three classifiers are used to classify network traffic datasets, and systems, especially with the large expansion of the use of Internet
these are Deep Feed-Forward Neural Network (DNN) and two en- networks and the rapid growth of the volume of data generated
semble techniques, Random Forest and Gradient Boosting Tree from several sources. This expansion and growth gave more space
(GBT). To select the most relevant attributes from the datasets, to hackers to launch their malicious attacks and use development
we use a homogeneity metric to evaluate features. Two recently techniques and tools for intrusion. On the other hand, researchers
published datasets UNSW NB15 and CICIDS2017 are used to eval- and developers of intrusion detection systems seek to increase the
uate the proposed method. 5-fold cross validation is used in this efficiency of malicious attack detection and the prediction of early
work to evaluate the machine learning models. We implemented attacks. Intrusion detection systems are one of the most impor-
the method using the distributed computing environment Apache tant systems used in cyber security. Intrusion refers to attempts to
Spark, integrated with Keras Deep Learning Library to implement compromise the confidentiality, integrity, availability of security
the deep learning technique while the ensemble techniques are mechanisms of computer or network resources or to bypass them.
implemented using Apache Spark Machine Learning Library. The Intrusion detection systems (IDSs) are the hardware or software
results show a high accuracy with DNN for binary and multiclass that monitors and analyzes data flowing through computers and
classification on UNSW NB15 dataset with accuracies at 99.16% for networks to detect security breaches that threaten confidentiality,
binary classification and 97.01% for multiclass classification. While integrity or availability of a system’s resources [9]. Intrusion detec-
GBT classifier achieved the best accuracy for binary classification tion systems use two basic methods to analyze events and detect
with the CICIDS2017 dataset at 99.99%, for multiclass classification attacks: misuse detection and anomaly detection. Misuse detection
DNN has the highest accuracy with 99.56%. (or signature-based detection) is an analysis of system activities
to search and detect patterns of attacks identical to or similar to
CCS CONCEPTS previously known attack patterns and stored in a database intru-
• Security and privacy → Intrusion/anomaly detection and sion detection system. An anomaly detection is the detection of
malware mitigation; • Computing methodologies → Super- unusual patterns of behavior in network traffic and it relies on
vised learning by classification. building models that represent the normal behavior of users, hosts
or the network where patterns of behavior that deviate from these
KEYWORDS models are detected and often represent abnormal behavior. The
anomaly detection approach is based on machine learning, artifi-
Intrusion detection system, big data, machine learning, artificial cial neural networks, and deep learning techniques that have been
neural networks, deep learning, ensemble techniques, feature se- widely used recently in the development of intrusion detection
lection. systems for mining and extracting knowledge through the training
ACM Reference Format: and testing of datasets [5]. Recently big data is being used in intru-
Osama Faker and Erdogan Dogdu. 2019. Intrusion Detection Using Big Data sion detection. Big data is data that is difficult to store, manage or
and Deep Learning Techniques. In 2019 ACM Southeast Conference (ACMSE manipulate using traditional techniques. The characteristics of big
2019), April 18–20, 2019, Kennesaw, GA, USA. ACM, New York, NY, USA, data include volume, variety and velocity [35] and they represent
8 pages. https://doi.org/10.1145/3299815.3314439 a major challenge for intrusion detection systems [28]. Volume
refers to the quantity of data where data are generated from several
Permission to make digital or hard copies of all or part of this work for personal or different sources having exploded very dramatically over recent
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation years. This requires monitoring and analyzing network traffic to in-
on the first page. Copyrights for components of this work owned by others than ACM tegrate with the management and processing of big data. The large
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
volume of data is often associated with another challenge, which is
fee. Request permissions from permissions@acm.org. variety, meaning different data sources and therefore various data
ACMSE 2019, April 18–20, 2019, Kennesaw, GA, USA types including structured, semi-structured and unstructured data.
© 2019 Association for Computing Machinery. Moreover, variety refers to heterogeneous data. Large IT infrastruc-
ACM ISBN 978-1-4503-6251-1/19/04. . . $15.00
https://doi.org/10.1145/3299815.3314439 tures can generate huge quantities of data from many resources,
such as application servers, networks, and workstations. Analyzing and classification algorithms, and their implementation into the big
and monitoring heterogeneous data is a complex challenge and data computing environment of Apache Spark. Correlation based
exacerbates the problems facing intrusion detection systems [36]. feature selection and Chi squared feature selection were used for
The huge change in the size and variety of data has also led to feature selection and logistic regression, support vector machines,
a change in the speed of data generation and streaming, referred random forest, Gradient Boosted Decision trees, and Naive Bayes
to as velocity. Big data management and computing technologies for classification. Dahiya et al. [7] suggested an intrusion detection
have been created and developed over the last few years, including system using Apache Spark to implement the proposed approach
Hadoop [26], Apache Spark [33], Hive [31] and NoSQL [14]. Big relying on two feature reduction algorithms, Canonical Correla-
data techniques have many advantages, such as speed in receiving, tion Analysis (CCA) and Linear Discriminant Analysis (LDA), and
storing and processing data of various types. This work proposes using seven well-known classification algorithms, namely Naive-
an assessment of integration between the management and compu- Bayes, REP Tree, Random Tree, Random Forest, Random Committee,
tation of big data and deep learning techniques using Apache Spark Bagging and Randomizable. Beluch et al. [3] evaluated the perfor-
and the Keras deep learning library. A deep neural network, random mance of a set of classification algorithms (SVM, Decision Tree,
forest and gradient boosted tree were used for classificaton, and k- Naive Bayes, Random Forest) using Apache Spark on the complete
means clustering technique is used feature selection by calculating UNSW-NB15 dataset with all 42 features and concluded with the
the degree of homogeneity. These suggested approaches are applied best performance for Random Forest (97% accuracy). Primartha and
on two recent datasets, namely CICIDS2018 and UNSW-NB15, both Tama [20] compared the performance of IDSs by applying a random
of which contain a set of common and updated attacks. forest classifier with respect to two performance measures, namely
This paper organized as follows: Section 2 presents the related accuracy and false alarm rate and used a 10-fold cross validation
work. Section 3 presents a brief description of the datasets and clas- technique. Three datasets of IDSs (NSL-KDD, UNSW-NB15 and
sification techniques used. Section 4 gives details of the proposed GPRS) were used in the experiment and the results were compared
approach. Section 5 presents the results and evaluation, followed by for the Multilayer Percepteron (MLP), Decision Tree, and NB-Tree
Section 6 with a summary and discussion. Finally, the conclusion classifiers. The results of the study demonstrated the effectiveness
and future work are presented in Section 7. of the proposed model based on Random Forest classifier with pa-
rameter settings and using a cross validation technique. Belouch et
al. [2] presented a two-stage classifier based on Reduced Error Prun-
2 RELATED WORKS ing Tree (RepTree) algorithm. In the first stage the method decides
Sharafaldin et al. [24] produced a reliable CICIDS2017 dataset that if it is an attack and in the second stage the type of attack is deter-
contained benign and seven common attack network flows. They ex- mined. This method gave better performance in speed and accuracy
amined the performance and accuracy of the selected features with using datasets UNSW-NB15 and NSL-KDD in evaluation results.
seven common machine learning algorithms, namely K-Nearest Al-Zewairi et al. [1] proposed a deep learning (DL) model based
Neighbor, Random Forest, ID3, Adaboost, Multilayer Perceptron, on Artificial Neural Networks (ANN) using the back-propagation
Naive-Bayes and Quadratic Discriminant Analysis. In addition, the and stochastic gradient descent methods, wherein the model was
new dataset is compared with publicly available datasets from 1998 evaluated as a binomial classifier for NIDSs on the UNSW-NB15
to 2016 based on 11 criteria representing common errors and criti- dataset. DL model contained five hidden layers in each layer of ten
cisms of previous datasets. The comparison results show that the neurons and the 10-fold cross validation technique is used. The
new dataset addresses all errors and criticisms. results show that the model obtained high accuracy and a low alarm
Nour and Jill [19] proposed a statistical analysis to evaluate rate compared to earlier models.
UNSW-NB15 and KDD99 datasets wherein UNSW-NB15 was di-
vided into a test set and a training set and assessed from three 3 DESCRIPTION OF DATASETS AND
aspects of statistical analysis, feature correlation, and the complex- CLASSIFICATION TECHNIQUES
ity evaluation. The results show that UNSW-NB15 is considered
There are several datasets available to evaluate the proposed tech-
a reliable dataset to evaluate existing and novel methods of IDS.
niques in developing and improving the performance of intrusion
Coelho et al. [6] suggested the use of a homogeneity metric between
detection systems. According to most of the studies on intrusion
labels and data clusters for semi-supervised feature selection. The
detection systems, KDD Cup 99 dataset1 (launched in 1999) [29]
results show that information retrieved from clusters can improve
and a refined version named NLS-KDD datasets2 are used mostly.
the estimation of feature relevance and of feature selection tasks,
These datasets include four types of attacks: DOS/DDOS, Probing,
especially when the number of labeled data is too small and the
U2R, and R2L. They are relatively old and cannot be relied upon to
unlabeled data is numerous. Vijayanand et al. [32] proposed a novel
evaluate intrusion detection systems because they do not contain
intrusion detection system with a genetic algorithm based feature
new types of attacks and modern normal behaviors, especially with
selection and multiple support vector machine classifiers. The pro-
the great development in attack methods and the emergence of new
posed approach relies on selecting the informational features for
types [8]. Therefore, we used two new intrusion detection datasets
each category of attack instead of the common features of every
used in recent studies, namely UNSW-NB15 and CICIDS2017. In
attack and applied to CICIDS2017 dataset. The experiments demon-
the remaining part of this section, we first present the datasets we
strates the effectiveness of the novel approach and achieved a high
rate of accuracy in intrusion detection. Gupta and Kulariya [13] 1 KDD Cup 1999, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
proposed a framework that combines feature selection algorithms 2 NLSKDD, https://www.unb.ca/cic/datasets/nsl.html
used, then a Homogeneity Metric used for feature selection, and the • the data points closest to the centroid are grouped.
classification techniques used in this work (Deep Neural Network, • the mean distance between these data points is calculated
Random Forest, Gradient Boosted Tree). and the mean is defined as a new centroid.
• the process is repeated until no data points move between
3.1 UNSW-NB15 Dataset the clusters.
UNSW-NB153 is one of the latest datasets created by the cyber- Homogeneity on the other hand is a clustering metric that is
security research group at Australian Centre of Cyber Security used to determine the homogeneity of data points in a single cluster.
(ACCS) to evaluate IDSs. It has become available to researchers The cluster must assign only those data points that are members of
since late 2015. The IXIA PerfectStorm4 testing platform is used to a single class to a single cluster. This means that the cluster entropy
generate approximately 100GB of normal and abnormal network should be zero. Entropy refers to randomness or unpredictability.
traffic. 49 features were extracted from the raw pcap file by using the Homogeneity is defined as [22]:
Argus5 and Bro-IDS6 split into five sets of basic features: flow fea-
tures, content features, time features, and the additional generated H (C, K)
h=
features. The dataset has a total number of 2,540,047 records with H (C)
9 different types of recent and common attacks, namely, Fuzzers, where H (C, K) is the conditional entropy of the classes given the
Analysis, Backdoors, DOS, Exploits, Generic, Reconnaissance, Shell- cluster assignments. H (C, K) is calculated as:
code and worms. In the dataset 321,283 records are attack records,
|K | Õ
|C |
and the total number of normal records is 2,218,764. The size of Õ nc,k nc,k
the normal information packets represents 88% of the dataset size, H (C, K) = − log
n nk
while the attack information packets represent 12%. k=1 c=1
and H (C) is the entropy of the classes and is given by:
3.2 CICIDS2017 Dataset |C |
CICIDS20177 [17] dataset is released in late 2017 from Canadian
Õ nc nc
H (C) = − log
Institute for Cybersecurity (CIC). The dataset contains benign and c=1
n n
the most current common attacks. B-Profile system [18] is used where n is the number of datapoints, nc is the number of datapoints
for the abstract behavior of human interactions and to generate belonging to class c, nk the number of datapoints respectively be-
benign natural background traffic and CICFlowMeter [24] is used longing to cluster k, and nc,k the number of datapoints from class
to extract the features from the dataset. CICIDS2017 Dataset con- c assigned to cluster k.
tains the most common attacks based on the 2016 McAfee report
(DOS, DDOS, Web-based, Brute force, Infiltration, Heart-bleed, Bot 3.4 Deep Neural Networks (DNN)
and Scan) with more than 80 features extracted from generated
Neural Networks [23] is a technique created to simulate the human
network traffic. To create the dataset a complete network topology,
brain for patterns recognition, and used in many learning tasks [27].
including different operating systems (Windows, Ubuntu, MAC
In general, it consists of three layers, one layer for input, one for
OS X) and network devices (modems, switches, firewall, routers)
output and at least one layer hidden between them. Input passes
are used. CICIDS2017 dataset Contains 14 types of attacks with
from the input layer through the hidden layers, up to the output
2,273,097 records of normal packet information (BENIGN) (80% of
layer through a set of neuron nodes in each layer, whether it be
the dataset) and 557,646 records of attack packets information (20%
a linear relationship or a non-linear relationship. DNN indicate
of the dataset).
that there is more than one hidden layer in the neural network. It
is widely used in supervised and unsupervised learning, and for
3.3 K-means Clustering and Homogeneity classification and clustering.
Metric Input data is passed to the hidden layers by a group of neurons
K-means clustering algorithm is one of the most common unsuper- in each layer such that these neurons are connected to each other
vised machine learning algorithms and is defined as a method in by weight, which represents the importance of the input value, the
which data are divided into K groups in a manner such that objects more valuable neurons from the others will have a greater impact
in each group share more similarity than with other objects in other on the next layer of neurons. Various types of Artificial Neural
groups [30]. K is the number of clusters determined by the user. Networks (ANN) have been developed, the first and simplest Neural
After determining the number of clusters, Networks that are widely used are Feed-Forward Neural Networks,
• the centroids are chosen randomly for each cluster. which is proposed in this work as one of the machine learning
• the distance between the centroid and the data points is techniques to be used. In this type of networks, information is
measured by the Euclidean equation. transmitted in parallel from the input layer directly through the
hidden layers and then into the output layer without cycles/loops.
3 https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15- The proposed Deep Neural Network contains three hidden layers.
Datasets/ Each layer is fully connected to the next layer in the network, where
4 https://www.ixiacom.com/products/perfectstorm
5 https://qosient.com/argus/index.shtml ReLU function is used in hidden layers and sigmoid function is
6 https://qosient.com/argus/index.shtml used in the output layer for binary classification while the SoftMax
7 http://www.unb.ca/cic/datasets/ids-2017.html function is used in the output layer for multiclass classification.
3.5 Random Forest (RF) and Gradient Boosted evaluating the learning models, as explained below. The evaluation
Tree (GBT) code can be accessed from the project repository8 .
Ensemble methods are used in machine learning to minimize noise,
bias and variance factors [34]. They aim to improve the stability 4.1 Dataset Preprocessing
and accuracy of machine learning algorithms. An ensemble method To provide more suitable data for the neural network classifier,
is a set of predictions that are integrated to obtain a final prediction the dataset is passed through a group of preprocessing operations.
where several different predictions are combined to reach the target These operations are summarized below:
prediction. Ensemble techniques are classified as Boosting and • Removing socket information: As the original dataset in-
Bagging. cludes the IP address and port numbers of the source and
Random Forest (Bagging) [4] is a supervised learning algorithm destination hosts in the network, it is important to remove
that is one of the most used algorithms as it is flexible and easy to use such information to provide unbiased detection. Using such
for classification and regression tasks. It depends on the ensemble information may result in over-fitted training towards the
method where it creates a set of Decision Trees and combines them socket information. However, it is more important to allow
to obtain a higher accuracy and stability prediction. RF rely on the the classifier to learn from the characteristics of the packets
Decision Tree algorithm to build trees. The more trees that are built, itself, so that any host with similar packet information is
the greater the ability of RF to resist noise and increase efficiency of filtered out regardless of its socket information.
classification. Furthermore, the ability of a RF algorithm to operate • Removing white spaces: Some of the multi-class labels in
within distributed and parallel computing and efficiently process the dataset include white spaces. Such white spaces result
data by simple classifiers built by the Decision Tree gives it the in different classes as the actual value differs from the labels
advantage of scalability and adaptation to large changes in data of other tuples in the same class.
volume and variety [16]. The growth of a single Decision Tree can • Label encoding: The multi-class labels in the dataset are
be described within RF in the following steps: provided with the attack names, which are string values.
• Several random sub-sampling sets are created from the basic Thus, it is important to encode these values into numerical
data set with a replacement. values, so that the classifier can learn the class number to
• The features are selected with the same sampling approach which each tuple belongs. This operation is executed using
where a subset of the features is randomly selected from the the multi-class labels only, as the binary labels are already
sum of the overall features. in zero-one formation.
• Training the Decision Tree on each sample. • Data normalization: The numerical data in the dataset is of
• Decision Trees grow without pruning. different ranges, which poses a number of challenges to the
• Integrating all the Decision Trees prediction results by sim- classifier during training to compensate these differences.
ple majority voting. Thus, it is important to normalize the values in each attribute
so that the minimum value in each attribute is zero and
Gradient Boosting Tree (Boosting) [11] is a supervised machine the maximum being one. This provides more homogeneous
learning method that is widely used for classification and regression values to the classifier while maintaining relativity among
tasks. GBT is an ensemble method similar to RF. However, the the values of each attribute.
difference lies in the creation of the predictors (Decision Trees). GBT • Removing/replacing missing and infinity values: CICIDS2017
is based on weak learners (high bias, low variance). A weak learner dataset contains 2,867 tuples as missing or infinity values.
in a Decision Tree means a shallow tree such that the GBT starts This is addressed in two ways and that produced two datasets.
with the shallow tree to build a predictor, followed by calculating The first is a dataset without the missing or infinite values
the error of expectations and passing the errors to the second tree as (Rem-CICIDS2017), where all missing and infinity values are
a target. The second tree adopts the new prediction model according removed. The second is a dataset with the infinite values
to the data of the first tree model. The error is calculated for the replaced with the maximum value and the missing values are
new predictor model and passed to the third tree and so forth. replaced with the average values (Rep-CICIDS2017). Both
datasets are used to evaluate the proposed method.
4 PROPOSED APPROACH • Removing normal traffic: For multiclass classification, infor-
mation packets that represent normal network traffic from
The proposed method includes a set of steps that begins with the
both datasets are ignored since they consist of a large portion
preprocessing of the datasets and then the selection of features,
of the traffic and only the attack information packets are
where the K-means clustering (with homogeneity metric) is used
used to evaluate the proposed method.
as an unsupervised feature selection technique for the selection of
relevant features from the datasets to improve the performance of
classifiers. Five-fold cross validation is used to estimate and improve 4.2 Features Ranking and Selection using
the performance of machine learning models. Deep neural network Homogeneity Metric
and two ensemble techniques (RF and GBT) are used to extract the After the preprocessing phase, K-means clustering algorithm is
models using the relevant subsets of features. applied to the dataset for feature ranking. The technique used for
Our proposed approach consists of three major phases. These are
data preprocessing, feature ranking and selection, and creating and 8 https://github.com/OsamaFaker/INTRUSION-DETECTION-BIG-DATA
feature ranking and selection is that taking each attribute separately, the depth of the tree is set to 4; in GBT, loд loss function is used
then use it to cluster the dataset. In binary classification (K=2) the and 100 iterations are applied.
data points are clustered into two groups, normal and anomaly. For The second scenario is for multiclass classification, where the
multi-class classification, K equals to the number of attack types attack types are the classes. There are nine different attacks in
in the dataset. Thereafter, the homogeneity score is calculated for UNSW-NB15 dataset and fourteen different attacks in CICIDS2017
the resulting clusters, which is then used as a ranking score for the dataset. Therefore, for Deep Learning the number of nodes in the
feature used for clustering. High ranking scores indicate that better output layer is set to the number of attack types, and So f tMax
classification can be conducted relying on those features, while activation function is used.
lower scores indicate that those features do not have a significant To obtain high performance and reduce the bias of machine learn-
role in the classification. When the rank of homogeneity is deter- ing techniques, k-fold cross-validation (5-fold in our experiments)
mined for each feature, they are arranged in descending order from is used in this work to evaluate the machine learning models. The
the highest rank to the lowest rank. The homogeneity rank value entire dataset is split into 5 bins randomly, each bin is then used
is between zero and one, zero referring to the lack of homogene- once for testing, while the remaining bins are used for training in
ity among the data points of the feature in the clusters, and one each iteration. The average accuracy of those evaluations are then
referring to high homogeneity for the feature. Finding a unique used as the final accuracy value for each model.
feature that belongs to a single class is probably more effective than Feature ranking and selection algorithm is summarized in Algo-
relying on common features in the classification process, especially rithm 1 below.
with the increasing volume of heterogeneous data generated from
several sources. input : Dataset D with fatures f 0 ,f 1 , ..., fn−1 and attack label
(1...k) for each record
output : Best accuracy and the set of features resulting in the
best accuracy
4.3 Applying Deep Neural Networks and for i ← 0 to n − 1 do
Ensemble Techniques // K-means clustering with k = # attack types
Deep Neural Networks, Random Forest, and Gradient Boosting Tree C ← Kmeans(D[fi ], k);
classification algorithms are applied first on the full CICIDS2018 HS { fi } ← homoдeneity_score(C);
and UNSW-NB15 datasets, and then on the subsets of selected end
features from both datasets. Deep Neural Networks is considered HS ′ = reverse_sort(HS) // in descending order;
one of the Feed-Forward Artificial Neural Networks. DNN consists D’ = D;
of multiple layers of nodes. Each layer is fully connected to the
for i ← n − 1 to 1 do
next layer in the network. 43 and 78 nodes are used in input layer HS ′, D ′ ←
to represent the number of input features in UNSW-NB15 and remove the lowest scored f eature fi f rom HS ′ and D ′ ;
CICIDS2017 datasets respectively. Three hidden layers with 128, 64,
accuracy{ fi } ← train_test_n_f old(D ′, 5);
and 32 nodes are set per layer respectively. ReLU activation function
end
is used in the hidden layers. So f tMax function used in the output
layer for multiclass classification and siдmoid function is used for index ← max(accuracy);
binary classification. With the backpropagation for learning the best_accuracy ← accuracy(index);
model, the training epoch is set to 1000 (epoch means one pass return best_accuracy, HS ′ (0, index);
over the full training set) and batch size is set to 1 million, the total Algorithm 1: Feature ranking and selection
number of training examples present in a single batch. We tested
the datasets in two scenarios, one for binary classification in which
the data points are considered for normal and attack types, and in 5 RESULTS AND EVALUATION
the second scenario all attack types are considered for a multi-class
This section presents the results of evaluations for the two scenarios,
classification task.
namely the binary and multi-class classifications. The experiments
The first scenario uses binary classification as pointed out, in
are conducted on Amazon’s Elastic Map Reduce (EMR) cloud ser-
which the aim of the classification process is to identify two groups;
vices with PySpark setup. A cluster of ten nodes are used for these
predicting each packet’s type as normal or attack traffic. Machine
experiments in which each node had a 2.4 GHz Intel Xeon E5 2676
learning techniques are initially applied to datasets with all features
v3 processor and 64 GB of memory. The random forest and GBT
(full consideration). Then, we remove the feature with the lowest
classifiers are implemented using Apache Spark’s built-in machine
rank (the lowest homogeneity score) from the dataset and repeat the
learning library MLlib. The deep learning model is implemented
classification and evaluate the accuracy metric. We keep removing
using the distributed Keras9 library.
the next lowest ranking feature in each repetition and evaluating
the classification result until all the last feature (the highest scoring 5.1 Binary Classification
feature). We applied this scenario on both datasets (UNSW-NB15
and CICIDS2017) using the above explained algorithm for all three In binary classification with UNSW-NB15 dataset, although the
classification algorithms. In DNN, the output has two nodes and differences are marginal. Table 1 shows the accuracy results, where
the siдmoid function is used; in RF, the number of trees is set to 100, 9 https://keras.io
the three classifiers achieved high accuracy rates. Full dataset and Table 3: Binary Classification Results with Rem-CICIDS2017
the best feature selection results are reported. #ftrs column reports dataset
the number of features used in the best accuracy case. DNN with
three hidden layers is the best performing classifier, with 99.19% Full Dataset Selected Features
accuracy, and then RF classifier achieved 98.86% accuracy. GBT Prediction Prediction
Accuracy Accuracy #ftrs
provided an extremely low average prediction time performance Time (ns) Time (ns)
per each packet at 0.35 ns, compared to DNN and RF classifiers’ DNN 97.71% 0.05 97.72% 0.05 59
prediction times with 1.35 ns and 11.36 ns respectively. This is RF 92.54% 7.68 92.71% 6.81 8
due to the lowest number of features used in this case (26 features). GBT 99.81% 1.24 99.99% 0.49 21
Moreover, by comparing the results of the proposed method (feature
selection) with the results of the classifiers on full dataset, there
was a slight improvement in the accuracy, yet the prediction times 5.2 Multiclass Classification
improved considerably, especially with GBT from 0.7 ns to 0.35 ns For multiclass classification, information packets that represent
since the number of features used is dropped from 49 to 26. normal network traffic from both data sets are deleted and only
the attack information packets are kept. 321,283 tuples that repre-
sent packet information of 9 different types of attacks in UNSW-
Table 1: Binary Classification Results with UNSW-NB15 NB15data set and 557,646 tuples in CICIDS2017 data set from 14
Dataset different types of attacks are used. Here, we only tested DNN and
RF classifiers, since GBT does not support multiclass classification
Full Dataset Selected Features in Apache Spark.
Prediction Prediction UNSW-NB15 dataset is tested with DNN and RF, and the results
Accuracy Accuracy #ftrs
Time (ns) Time (ns) show that DNN has a much higher accuracy with 97.04% in com-
DNN 99.16% 1.43 99.19% 1.35 41 parison to RF, and much lower prediction time (4.71 ns) with lower
RF 98.85% 12.13 98.86% 11.36 36 number (29) of features selected (Table 4). Accuracy is slightly better
GBT 97.83% 0.70 97.92% 0.35 26 in feature selection method.

Table 4: Results of Multiclass classification with UNSW-

In the case of CICIDS2017 dataset with binary classification we NB15 dataset
also tried two versions of the dataset, one with replacement of
missing and infinite values (Rep-CICIDS2017), and another with Full Dataset Selected Features
removal of those values (Rem-CICIDS2017). The best accuracy is Prediction Prediction
Accuracy Accuracy #ftrs
obtained with GBT classifier on Rep-CICIDS2017 with 99.97% but Time (ns) Time (ns)
the lowest prediction time is obtained by DNN with 0.05 ns (Table 2). DNN 97.01% 5.15 97.04% 4.71 29
RF on the other has the worst performance with 92.72% accuracy, RF 91.76% 24.85 91.77% 23.63 31
but using the lowest number of features (6 out of 49) with the
worst average prediction time performance (6.81 ns). There are
By using attack packets of Rep-CICIDS2017, DNN achieved a
also minimal improvements in accuracy with feature selection in
high accuracy (99.55%) in the classification of attack packets with
comparison to the full dataset, meaning eliminating some features
(0.64 ns) average prediction time, which is very low using the full
helps in improving the accuracy as well the prediction times (1.23
dataset (Table 5). DNN classifier obtained a slightly higher accuracy
ns vs 0.53 ns in GBT case).
on the dataset with feature selection, again with a high improve-
ment over RF classifier (Table 5).
Table 2: Results of Binary Classification with Rep-
CICIDS2017 Dataset Table 5: Results of Binary Classification with Rep-
CICIDS2017 Dataset
Full Dataset Selected Features
Prediction Prediction Full Dataset Selected Features
Accuracy Accuracy #ftrs Prediction Prediction
Time (ns) Time (ns) Accuracy Accuracy #ftrs
DNN 97.72% 0.05 97.73% 0.05 59 Time (ns) Time (ns)
RF 92.54% 7.71 92.72% 6.78 6 DNN 99.55% 0.64 99.57% 0.70 75
GBT 99.81% 1.23 99.97% 0.53 23 RF 92.57% 7.40 92.72% 6.22 8

Same experiments are repeated on Rem-CICIDS2017 dataset

Table 3 shows the results on Rem-CICIDS2017 dataset for binary (Table 6). There is no significant differences in comparison to the
classification. There are very small differences in the results com- results of Rep-CICIDS2017 dataset. This means there is little be-
pared to Rep-CICIDS2017 dataset, however GBT achieved the best tween deleting or replacing the lost and infinite values. DNN again
performance with (99.99%) accuracy. resulted in the highest accuracy (99.56%) in both full dataset and
feature selected dataset cases). This is again much better than RF’s Table 7: Comparison of Accuracy of Binary Classification
(92%) accuracy levels. Prediction with Earlier Studies Used UNSW-NB15 Dataset

Table 6: Results of Binary Classification with Rem- Study Classifier Acc (%)
CICIDS2017 dataset
Primartha and Random Forest 95.5
Tama [18]
Full Dataset Selected Features Multilayer Perceptron 83.50
Prediction Prediction Nour and Slay [12] Naive Bayes 79.50
Accuracy Accuracy #ftrs
Time (ns) Time (ns) Expectation-Maximization 77.20
DNN 99.56% 0.63 99.56% 0.73 67 Linear Regression 83.00
RF 92.54% 6.98 92.71% 6.04 6 Belouch, et al. [19] RepTree 87.80
Naive Bayes 80.04
Random Tree 86.59
6 DISCUSSION Decision Tree 86.13
In binary classification with the UNSW-NB15 dataset, although the Artificial Neural Network 86.31
differences are marginal, the results show that the best accuracy Zewairi, et al. [20] Deep Learning 98.99
results are obtained by DNN at (99.16%). GBT provided an extremely Our Work Deep neural network 99.19
lower average prediction time per packet at (0.70 ns), compared Random Forest 98.86
to the other classifiers which consumed (1.43 ns) and (12.13 ns). Gradient Boosted Tree 97.92
When we compare the full dataset results with the results of the
feature selection approach, we see a slight improvement in the Table 8: Comparison of Prediction’s Accuracy Attack of Mul-
accuracy and time prediction. Using the CICIDS2017 dataset with ticlass Classification with Earlier Studies Used UNSW-NB15
the two methods that replace and remove the missing and infinity Dataset
values, the best accuracy rate was provided by GBT classifier at
(99.81%). DNN classifier achieved the lowest prediction time with Study Classifier Acc (%)
(0.05 ns). For multi-class attack classification, the results show the Belouch, et al. [19] RepTree 79.20
superiority of DNN using both attack datasets with (5.15 ns) and Random Tree 76.21
(0.64 ns) prediction times, providing (97.01%) and (99.55%) accuracy Naïve Bayes 73.86
levels with UNSW-NB15 and CICIDS2017 datasets respectively. Artificial Neural Network 78.14
In contrast to earlier methods, the proposed approach achieved
Gharaee, Genetic + SVM 93.25
better performances in accuracy with the use of the homogeneity
Hamid [36]
metric in feature ranking and selection.
Our Work Deep neural network 97.04
Apache Spark technique has greatly improved the training and
Random forest 91.77
prediction time of the three classifiers when compared to tradi-
tional techniques [10]. This improvement gives intrusion detection Table 9: Comparison of Accuracy of Binary Classification
systems the ability to make decisions more efficiently in terms Prediction with Earlier Studies Used CICIDS2017 Dataset
of blocking or allowing data to pass through a network. In addi-
tion, the integration between Apache Spark and the Keras Deep Study Classifier Acc (%)
Learning Library has increased the capabilities of deep learning
Iman , et al. [11] K-Nearest Neighbors 96.00
algorithms to work more efficiently and more quickly. The slight
Random Forest 98.00
improvement in classifier performance using the proposed feature
ID 98.00
selection approach suggests that this approach can be developed to
Adaboost 77.00
deal more efficiently with heterogeneous data, which is one of the
Multilayer Perceptron 77.00
most significant challenges of intrusion detection systems.
Naive Bayes 88.00
As shown in the following Table 7, the accuracy of classifiers
Quadratic Discriminant Anal- 97.00
used in this study is compared with the previous studies in binary
ysis
classification of UNSW-NB15 dataset. Our method with feature
selection has the best accuracy (99.19%) using DNN classifier on Vijayanand, et al.[14] SVM+ Genetic 99.85
UNSW-NB15 dataset. Alves and Drum- Genetic + Profiling 92.85
Table 8 shows the comparison between the accuracy of this mond.[37]
study and the earlier studies using UNSW-NB15 dataset for multi- Our Work Deep neural network 97.73
class classification. Again, our feature selection method with DNN Random forest 92.72
classifier achieved the best accuracy (97.04%). Gradient Boosted Tree 99.97
Table 9 compares our binary classification with feature selection
accuracy results on CICIDS2017 dataset (with replacement) to the
earlier studies. Again, our method performs better than the earlier
studies with 99.97% accuracy using GBT classifier.
7 CONCLUSION AND FUTURE WORK 3), 18 31.
[18] N. Moustafa, and J. Slay. 2015. "UNSW-NB15: a comprehensive data set for net-
This paper presented a method to improve the performance of in- work intrusion detection systems (UNSW-NB15 network data set)." Military
trusion detection systems by integrating big data technologies and Communications and Information Systems Conference (MilCIS), Canberra, Aus-
tralia 2015. IEEE, 2015.
deep learning techniques. UNSW-NB15 and CICIDS2017 datasets [19] N. Moustafa, and J. Slay. 2018. "The evaluation of Network Anomaly Detection
are used to evaluate the proposed approach. In our method we used Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with
the homogeneity metric to rank and select the features. Deep Neu- the KDD99 data set." Information Security Journal: A Global Perspective 25.1-3
(2016): 18-31.
ral Networks (DNN), Random Forest (RF), and Gradient Boosted [20] R. Primartha, and B. Tama. 2017. "Anomaly detection using random forest: A per-
Tree (GBT) classifiers are used to classify the attacks in binary and formance revisited." Data and Software Engineering (ICoDSE), 2017 International
multiclass modes. All experiments are conducted on Apache Spark Conference on. IEEE, Palembang Sumatra Selatan, Indonesia 2017.
[21] P. Resende, and A. Drummond. 2018. "Adaptive anomalyâĂŘbased intrusion
with the Keras Deep Learning Library. RF and GBT classifiers are detection system using genetic algorithm and profiling." Security and Privacy:
used from Apache Spark Machine Learning Library. The results e36.
show high accuracy levels with DNN for binary and multiclass clas- [22] A. Rosenberg, and J. Hirschberg. 2007. "V-measure: A conditional entropy-based
external cluster evaluation measure." Proceedings of the 2007 joint conference
sification on UNSW-NB15 dataset (99.19% and 97.04% respectively) on empirical methods in natural language processing and computational natural
and very low prediction times. GBT classifier achieved the best language learning (EMNLP-CoNLL). 2007.
[23] J. Schmidhuber. 2015. "Deep learning in neural networks: An overview." Neural
accuracy (99.99%) for binary classification using the CICIDS2017 networks 61 (2015): 85-117.
dataset, and (99.57%) accuracy for multiclass classification on the [24] I. Sharafaldin, A. Lashkari, and A. A. Ghorbani. 2018. "Toward Generating a
same dataset using DNN classifier. New Intrusion Detection Dataset and Intrusion Traffic Characterization." ICISSP.
Funchal, Madeira - Portugal 2018.
For future work, improving the performance of intrusion de- [25] I. Sharafaldin, A. Gharib, A. H. Lashkari, and A. A. Ghorbani .2018. Towards a
tection with feature selection using homogeneity metric will be reliable intrusion detection benchmark dataset. Software Networking, 2018(1),
investigated with better feature selection schemes. We did not re- 177-200.
[26] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. 2010. "The hadoop distributed
port the performance of distributed Apache Spark processing with file system." Mass storage systems and technologies (MSST), IEEE 26th symposium
varying node counts in the cluster, that is also in our agenda for on, pp.1-10.
[27] O.B. Sezer, M. Ozbayoglu, E. Dogdu. 2017. "A Deep neural-network based stock
future work and analysis. trading system based on evolutionary optimized technical analysis parameters",
Procedia computer science, 114, 473-480
[28] S. Suthaharan. 2014. "Big data classification: Problems and challenges in network
REFERENCES intrusion prediction with machine learning." ACM SIGMETRICS Performance
[1] M. Al-Zewairi, S. Almajali, and A. Awajan. 2017. "Experimental Evaluation of Evaluation Review 41.4 (2014): 70-73.
a Multi-layer Feed-Forward Artificial Neural Network Classifier for Network [29] M. Tavallaee, E. Bagheri, W. Lu, and A. A.Ghorbani. 2009. A detailed analysis of
Intrusion Detection System." New Trends in Computing Sciences (ICTCS), 2017 the KDD CUP 99 data set. In Computational Intelligence for Security and Defense
International Conference on. IEEE, Amman, Jordan 2017. Applications. CISDA 2009. IEEE Symposium on (pp. 1 6). IEEE.
[2] M. Belouch, S. El Hadaj, and Mo. Idhammad. 2017. " two-stage classifier approach [30] K. Teknomo. 2006. "K-means clustering tutorial." Medicine 100, no. 4 (2006): 3.
using RepTree algorithm for network intrusion detection." International Journal [31] A. Thusoo, et al.2009. "Hive: a warehousing solution over a map-reduce frame-
of Advanced Computer Science and Applications (ijacsa) 8.6 (2017): 389-394. work." Proceedings of the VLDB Endowment 2.2 (2009): 1626-1629.
[3] M. Belouch, S. El Hadaj, and M. Idhammad. 2018. "Performance evaluation of [32] R. Vijayanand, D. Devaraj, and B. Kannapiran.2018. "Intrusion detection system
intrusion detection based on machine learning using Apache Spark." Procedia for wireless mesh network using multiple support vector machine classifiers
Computer Science 127 (2018): 1-6. with genetic-algorithm-based feature selection." Computers & Security 77 (2018):
[4] L. Breiman. 2001. "Random forests." Machine learning 45.1 (2001): 5-32. 304-314.
[5] V. Chandola, A. Banerjee, and V. Kumar. 2009. "Anomaly detection: A sur- [33] M. Zaharia, et al. 2016. "Apache spark: a unified engine for big data process-
vey." ACM computing surveys (CSUR) 41.3 (2009): 15. ing." Communications of the ACM 59.11 (2016): 56-65.
[6] F. Coelho, A. Braga, and M. Verleysen. 2012. "Cluster homogeneity as a semi- [34] C. Zhang, and Y. Ma, eds. 2012. Ensemble machine learning: methods and appli-
supervised principle for feature selection using mutual information." ESANN. cations. Springer Science & Business Media, 2012.
Bruges, Belgium 2012. [35] P. Zikopoulos and C. Eaton. 2011. Understanding big data: Analytics for enterprise
[7] P. Dahiya, and D. Srivastava. 2018. "Network Intrusion Detection in Big Dataset class hadoop and streaming data. McGraw-Hill Osborne Media.
Using Spark." Procedia Computer Science 132 (2018): 253-262. [36] R. Zuech, T. M. Khoshgoftaar, and R. Wald. 2015. "Intrusion detection and big
[8] L. Dhanabal, and S. p.Shantharajah. 2015. A study on NSL KDD dataset for intru- heterogeneous data: a survey." Journal of Big Data 2.1 (2015): 3.
sion detection system based on classification algorithms. International Journal of
Advanced Research in Computer and Communication Engineering, 4(6), 446 452.
[9] R. Di Pietro and L. V. Mancini, eds. 2008. Intrusion detection systems. Vol. 38.
Springer Science & Business Media.
[10] Osama Mohamed Faker Faker. 2018. "Intrusion Detection Using Big Data And
Deep Learning Techniques", MS Thesis, Cankaya University (2018).
[11] J.H. Friedman. 2002. "Stochastic gradient boosting." Computational Statistics &
Data Analysis 38.4 (2002): 367-378.
[12] H. Gharaee, and H. Hosseinvand. 2016. "A new feature selection IDS based on
genetic algorithm and SVM". Telecommunications (IST), 2016 8th International
Symposium on. IEEE, 2016.
[13] G.P. Gupta , and M. Kulariya. 2016. "A framework for fast and efficient cyber
security network intrusion detection using apache spark." Procedia Computer
Science 93 Kochi, INDIA.(2016): 824-831.
[14] J. Han, E. Haihong, G. Le, and J. Du. 2011. "Survey on NoSQL database." In
Pervasive computing and applications (ICPCA), Port Elizabeth, South Africa 2011
6th international conference on, pp. 363-366. IEEE, 2011.
[15] A. Lashkari, G. Draper-Gil, M. Mamun, and Ali A. Ghorbani. 2017. "Characteriza-
tion of Tor Traffic using Time based Features." In ICISSP, pp. 253-262. 2017.
[16] Y. Liu. 2014. "Random forest algorithm in big data environment." Computer Mod-
elling & New Technologies18.12A (2014): 147-151.
[17] N. Moustafa, and J. Slay. 2016. The evaluation of Network Anomaly Detection
Systems: Statistical analysis of the UNSW NB15 data set and the comparison
with the KDD99 data set. Information Security Journal: A Global Perspective, 25(1