0% found this document useful (0 votes)

22 views

Autoencoder Architecture

Autoencoder architecture

Uploaded by

Louzanne Swart

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Autoencoder Architecture

Autoencoder architecture

Uploaded by

Louzanne Swart

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

This article has been accepted for publication in IEEE Internet of Things Journal.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3211346

An enhanced AI-based Network Intrusion Detection

System using Generative Adversarial Networks
Cheolhee Park, Jonghoon Lee, Youngsoo Kim, Jong-Geun Park, Hyunjin Kim, and Dowon Hong

Abstract—As communication technology advances, various and techniques have become more complex and sophisticated, and
heterogeneous data are communicated in distributed environ- the frequency of attacks has also increased. Accordingly, the
ments through network systems. Meanwhile, along with the importance of cybersecurity is emphasized, and various studies
development of communication technology, the attack surface
has expanded, and concerns regarding network security have have been actively conducted to prevent potential network
increased. Accordingly, to deal with potential threats, research threats.
on Network Intrusion Detection Systems (NIDS) has been ac- One of the fundamental challenges in cybersecurity is the
tively conducted. Among the various NIDS technologies, recently detection of network threats, and various results have been
interest is focused on artificial intelligence(AI)-based anomaly reported in the field of network intrusion detection systems
detection systems, and various models have been proposed to
improve the performance of NIDS. However, there still exists the (NIDS). In particular, the most recent studies have been fo-
problem of data imbalance, in which AI models cannot suffi- cused on applying the artificial intelligence (AI) technology to
ciently learn malicious behavior and thus fail to detect network NIDS, and AI-based intrusion detection systems have achieved
threats accurately. In this study, we propose a novel AI-based remarkable performance. Initially, the research primarily fo-
network intrusion detection system that can efficiently resolve cused on applying traditional machine learning models such
the data imbalance problem and improve the performance of
the previous systems. To address the aforementioned problem, we as decision trees [1] (DT) and support vector machines [2]
leveraged a state-of-the-art generative model that could generate (SVMs) to existing intrusion detection systems, and it has
plausible synthetic data for minor attack traffic. In particular, now been extended to deep learning approaches [3] such
we focused on the reconstruction error and Wasserstein distance- as convolutional neural networks (CNNs), long short-term
based generative adversarial networks, and autoencoder-driven memory (LSTM), and autoencoders. Although these results
deep learning models. To demonstrate the effectiveness of our
system, we performed comprehensive evaluations over various have achieved remarkable performance in detecting anomalies,
datasets and demonstrated that the proposed systems significantly there still exist limitations in deploying them in real systems.
outperformed the previous AI-based NIDS. In general, most of the network flow data is normal traffic,
and malicious behavior that can cause service failure occurs
Index Terms—Network intrusion detection system, anomaly rarely. Moreover, within the category of malicious behavior,
detection, network security, generative adversarial network. most of the data are well-known attacks, and specific types of
attacks are extremely rare. Due to this data imbalance problem,
I. I NTRODUCTION AI models deployed in NIDS cannot sufficiently learn the
characteristics of specific network threats, and this may leave
With the development of the fifth-generation (5G) mobile the network systems vulnerable to the attacks owing to the
communication technology that diversifies the access environ- poor detection performance.
ments and constructs distributed networks, various and het- In this study, to address this inherent problem, we propose
erogeneous data are communicated through network systems. a novel AI-based network intrusion detection system that can
In general, these data originate from diverse domains such as resolve the data imbalance problem and improve the perfor-
sensors, computers, and the Internet of Things (IoT), and the mance of the previous systems. To address the aforementioned
capacity of network systems has been expanded to process problem, we leveraged a state-of-the-art deep learning archi-
these data reliably. However, as the access points are diversi- tecture, generative adversarial networks [4] (GAN), to generate
fied, the attack surface expands, thereby leaving the network synthetic network traffic data. In particular, we focused on
systems vulnerable to potential threats. Moreover, cyber-attack the reconstruction error and Wasserstein distance-based GAN
This work was supported by Institute of Information & communications architecture [5], which can generate plausible synthetic data
Technology Planning & Evaluation (IITP) grant funded by the Korea gov- for minor attack traffic. By combining the generative model
ernment (MSIT) (No.2020-0-00952, Development of 5G Edge Security Tech- with anomaly detection models, we demonstrated that the
nology for Ensuring 5G+ Service Stability and Availability). (Corresponding
author: Cheolhee Park.) proposed systems outperformed previous results in terms of
C. Park, J. Lee, Y. Kim, JG. Park, H. Kim are with the Electronics the classification performance.
and Telecommunications Research Institute, Daejeon 34129, South Ko- The entire architecture of our system consists of four main
rea (e-mail: chpark0528@etri.re.kr; mine@etri.re.kr; blitzkrieg@etri.re.kr;
queue@etri.re.kr; be.successor@etri.re.kr). stages (see Fig. 1): pre-processing, generative model training,
D. Hong is with the Department of Applied Mathematics, Kongju National autoencoder training, and predictive model training. In the
University, Gongju 32588, South Korea (e-mail: dwhong@kongju.ac.kr). pre-processing stage, the system refines the raw dataset into
”Copyright (c) 20xx IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be a format that deep learning models can learn. After pre-
obtained from the IEEE by sending a request to pubs-permissions@ieee.org.” processing, the system sequentially trains generative mod-

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Internet of Things Journal. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2022.3211346

Fig. 1. The entire systemic architecture of our AI-based NIDS.

els and an autoencoder model, where the trained generative scenarios, we show that the proposed system can be
models are utilized to train the autoencoder model. Finally, effectively applied to real-world environments.
the system trains predictive models by applying the trained The rest of this paper is organized as follows. Section 2
generative models and the encoder of the trained autoencoder, briefly reviews related research from the perspective of NIDS
where the generative models are used to generate scarce data based on machine learning and deep learning approaches, and
and the encoder is used as a feature extractor. In the case of the Section 3 provides background with a focus on autoencoders
classifier models, we consider three deep learning models that and generative adversarial networks. In Section 4, we describe
have been widely utilized in AI-based NIDS: deep neural net- our methodology and the proposed framework as well as
works (DNN), convolutional neural networks (CNN), and long the four main stages in detail. In Section 5, we evaluate
short-term memory (LSTM) model. To evaluate our system, the proposed system in various environments and present
we experimented with four network flow datasets considering experimental results with detailed analysis. Finally, we present
different scenarios: NSL-KDD [6][7], UNSW-NB15 [8], IoT concluding remarks and future work directions of this study
dataset [9], and real-world dataset. Through experiments on in Section 6.
these various datasets, we show that the proposed system
outperformed previous results. Moreover, we demonstrate that II. R ELATED WORK
our methodology can improve the performance of existing AI-
based NIDS by resolving the data imbalance problem. In the field of AI-based network intrusion detection systems,
many studies have been conducted to apply machine learning
The main contributions of the proposed approach can be
and deep learning technologies as anomaly detection. Ingre
summarized as follows:
and Yadav [10] proposed multi-layer perceptron-based intru-
• By combining the state-of-the-art GAN model that can sion detection system and showed that the proposed approach
generate plausible synthetic data and measure the con- achieve 81% and 79.9% accuracy in experiments on the NSL-
vergence of training, we show that the proposed system KDD dataset for binary and multi-classification, respectively.
outperforms existing AI-based NIDS in terms of detection Gao et al. [11] proposed a semi-supervised learning approach
rate. for network intrusion detection systems based on fuzzy and
• Through comparative experiments with various deep ensemble learning and reported that the proposed system
learning models, we present that the detection perfor- achieved 84.54% accuracy on the NSL-KDD dataset. By
mance for rare attacks can be improved by applying our applying the deep belief network (DBN) model, Alrawashdeh
methodology it as a base module. et al. [12] developed an anomaly intrusion detection system
• By experimenting with datasets collected from various and showed that the proposed DBN-based IDS exhibited

a superior classification performance in sub-sampled testing focused on the application of unsupervised learning, especially
sets (sampled subsets from the original dataset). By consid- autoencoder models. Javaid et al. [25] proposed a sparse
ering the Software Defined Networking environment, Tang autoencoder-based NIDS and reported that the proposed model
et al. [13] proposed a deep neural network-based anomaly achieved 79.1% accuracy for multi-classification on the NSL-
detection system and reported that the DNN-based approach KDD dataset. Similarly, Yan and Han [26] leveraged the sparse
outperformed traditional machine learning model approaches autoencoder model to extract high-level feature representations
(e.g., Naı̈ve Bayes, SVM, and Decision Tree). In [14], the of intrusive behavior information and demonstrated that the
authors proposed a restricted Boltzmann machine (RBM)- stacked sparse autoencoder model could be applied as an
based intrusion detection system and showed that the Gaus- efficient feature extraction method. Shone et al. [27] proposed
sian–Bernoulli RBM model outperformed other RMB-based a stacked non-symmetric deep autoencoder-based intrusion
models (such as Bernoulli-Bernoulli RBM and DBN). From detection system. In [27], the authors showed that the proposed
the perspective of utilizing both behavioral (network traffic model could achieve 85.42% accuracy in multi-classification.
characteristics) and content features (payload information), As one of the significant results, Ieracitano et al. [28] pro-
Zhong et al. [15] introduced a big data and tree architecture- posed an autoencoder-driven intrusion detection model. In
driven deep learning system into intrusion detection system, [28], the authors proposed autoencoder-based and LSTM-
where the authors combined shallow learning and deep learn- based IDS models and compared their performance with
ing strategies and showed that the system is particularly conventional machine learning models. Through experiments
effective at detecting subtle patterns for intrusion attacks. on the NSL-KDD dataset, they reported that the proposed
With the ensemble model-like approach, Haghighat et al. [16] autoencoder-based systems outperformed other models and
proposed an intrusion detection system based on deep learning achieved 84.21% and 87% accuracy for binary and multi-
and voting mechanisms. In [16], the authors aggregated the classification, respectively.
best model results, and showed that the system can provide As another approach to applying unsupervised learning,
more accurate detections. Moreover, they showed that the false several studies have investigated using generative models
alarms can be reduced up to 75% compared to the conven- to improve the performance of existing NIDS. In partic-
tional deep learning approaches. Considering data streams in ular, they have focused on applying the basic generative
industrial IoT environments, Yang et al. [17] proposed a tree adversarial networks (GAN) [4], which are based on the
structure-based anomaly detection system, where the authors Jensen-Shannon divergence (or Kullback-Leibler divergence)
incorporates the window sliding, detection strategy changing, [29][30][31]. Thereafter, along with the development of var-
and model updating mechanisms into the locality-sensitive ious GAN models, studies have been conducted to apply
hashing-based iForest model [18][19] to handle the infiniteness appropriate GAN models for specific purposes. Li et al. [32]
of data streams in real-time scenario. Similarly, Qi et al. and Lee et al. [33] utilized the Wasserstein divergence-based
[20] proposed an intrusion detection system for multiaspect GAN model to generate the synthetic data, and Dlamini et
data streams by combining locality-sensitive hashing, isolation al. [34] proposed a conditional GAN-based anomaly detection
forest, and principal component analysis techniques. In [20], model to improve the classification performance in the minor-
the authors showed that the proposed system can effectively ity classes. By focusing on specific industrial environments, Li
detect group anomalies while dealing with multiaspect data et al. [35] and Alabugin et al. [36] proposed LSTM-GAN and
and process each data row faster than the previous approaches. bidirectional GAN-based anomaly detection models, respec-
From the perspective of dealing with time-series data, sev- tively. Through experiments on the Secure Water Treatment
eral results have been reported focusing on recurrent models. (SWaT) dataset, they demonstrated that GAN models could
Kim et al. [21] proposed an LSTM-based IDS model and be effectively applied to IDS. Siniosoglou et al.[37] proposed
proved the efficiency of the proposed IDS. Yin et al. [22] an anomaly detection model that could simultaneously detect
proposed a recurrent neural network-based intrusion detection anomalies and categorize the attack types. In [37], the author
system and achieved 83.3% accuracy and 81.3% accuracy in encapsulated the autoencoder architecture into the structure
binary and multi-classification, respectively. Xu et al. [23] of the basic GAN model (i.e., deploying the encoder as a
developed a recurrent neural network-based intrusion detection discriminator and the decoder as a generator) and proved
model and reported that the gated recurrent unit was more suit- the efficiency of the proposed model in various smart grid
able as a memory unit for intrusion detection than the LSTM environments.
unit. By considering supervisory control and data acquisition Unlike previous GAN approaches that are based on the
(SCADA) networks, Gao et al. [24] proposed omni-intrusion distance between data distributions, we considered the recon-
detection system. In [24] the authors combined LSTM and a struction error-based GAN model to generate more plausible
feedforward neural network through an ensemble approach, synthetic data. In particular, we leveraged the Boundary Equi-
and showed that the proposed system can effectively detect librium GAN (BEGAN) model [5], which is based on the
intrusion attacks regardless of temporal correlation. Moreover, concept of autoencoders and the Wasserstein distance between
they demonstrated that the proposed omni-IDS outperformed reconstruction error distributions of samples (real and synthetic
previous deep learning approaches through experiments on a samples). Moreover, we incorporated the autoencoder model
SCADA testbed. into the detection models to extract meaningful features from
In addition to the previous approach of applying supervised the data and extend the adaptability and demonstrated that the
learning as an anomaly detection model, several studies have propsoed framework outperforms previous AI-based network

Fig. 3. The basic architecture of generative adversarial networks.

Fig. 2. The basic architecture of Autoencoder.

on complex datasets. Although we only present the basic

intrusion detection models. architecture of autoencoders, models can be built in multiple
layers and an asymmetric manner.
III. BACKGROUND
In this section, we briefly illustrate the concepts of autoen- B. Generative adversarial networks
coders and GAN, which are key components of our anomaly Generative models are designed to approximate the prob-
detection system. ability distribution of a training dataset and aim to generate
synthetic data that is close to the real data (training data).
A. AutoEncoder Recently, among these generative models, research on GAN
The autoencoder [38][39] is one of the fundamental deep [4] has been of significant interest. Accordingly, various GAN
learning models and is trained with an unsupervised learning models have been proposed to improve the performance and
process. The objective of autoencoders is to return the output advance functionality (e.g., [40]-[42]). A GAN model consists
as close to the original input as possible. Therefore, the param- of two neural network-based models: a generator G and a
eters are updated progressively during the training process to discriminator D. The generator G aims to generate synthetic
minimize the reconstruction error. In general, the architecture data (fake data) that is close to the real data, while the
of an autoencoder consists of two components: an encoder and discriminator D aims to discriminate between the real and
a decoder (Fig. 2). The encoder is responsible for mapping the fake data. In other words, these two components have opposing
given raw input data x into the latent space of representation: objectives during the training process.
More formally, let pz and pdata be the probability distribu-
z = f (xW + b) (1) tions of the latent code and the real data, respectively. Then,
the objective function V (D, G) of a GAN that consists of a
, where f denotes the activation function of the encoder, generator G and a discriminator D is a minimax game and
and W and b represent the weight matrix and the bias can be formulated as follows:
vector, respectively. Conversely, the decoder plays the role
of reconstructing the representation z into the corresponding
V (D, G) = min max Ex∼pdata [log (DθD (x))]
input data as close as possible (i.e., x̃): G D (4)
+Ez∼pz [log (1 − DθD (GθG (z)))]
x̃ = g(zW ′ + b′ ) (2)
, where θD and θG denote the model parameters of D and G,
, where g denotes the activation function of the decoder, respectively. Therefore, the discriminator is trained to output
and W ′ and b′ are the weight matrix and the bias vector, a higher confidence value in real data, and the generator
respectively. Therefore, the autoencoder is trained to minimize is trained to generate synthetic data that can maximize the
the reconstruction error LRE : confidence score in the discriminator. After a sufficient number
of iterations of this training process, both the discriminator and
LRE (x, x̃; W, W ′ ) = ∥x − x̃∥22 generator will settle to a point, where there is no scope for
(3) further improvement (i.e., a Nash equilibrium is achieved).
= ∥x − g(W ′ · f (xW + b) + b′ )∥.
Since the basic concept of the GAN model was introduced,
One of the fundamental characteristics of the autoencoder is numerous variants have been proposed to develop the original
to represent high-dimensional input data as lower-dimensional model by adjusting the objective function or by modifying the
information (summarized but meaningful information). Herein, model architecture. Among these various models, we focus
we utilized autoencoders with the aim of feature extraction on the BEGAN model [5], which is based on the concept
(dimension reduction) on the input data. Although Principal of autoencoders and reconstruction errors. Unlike other GAN
Component Analysis (PCA) has traditionally been utilized to models wherein the objective function is defined based on the
project high-dimensional data into a lower-dimensional space, distance of distributions between confidence vectors (on real
we leveraged the autoencoders for non-linear transformations and synthetic samples), the objective of BEGAN is defined

based on the Wasserstein distance between reconstruction error the one-hot encoding process, the system scales the numeric
distributions as follows: attributes. In general, normalization (e.g., [28]) and standard-
ization (e.g., [24]) can be considered as scaling for numeric

 LD = L(x; θD ) − kt · L(G(z; θG ); θD ) features. Between these two approaches, we adopted the min-
LG = L(G(z; θG ); θD ) (5) max normalization method.2 The normalization function fA (·)

kt+1 = kt + λk · (γ · L(x; θD ) − L(G(z; θG ); θD ) for a numeric attribute A that maps ∀x ∈ A into a range [0, 1]
can be defined as follows:
, where the hyper-parameter γ ∈ [0, 1] is the diversity ratio1 , xi − min xj
and λk serves as the learning rate for k. Note that L(·) denotes fA (xi ) = x̃i = (7)
max xj − min xj
the reconstruction error of the autoencoder, and t indicates the
iteration step. , where xi denotes the i-th attribute value in the attribute A.
In general, existing deep learning-based approaches con-
IV. P ROPOSED M ETHODOLOGY sider feature extraction (e.g., principal component analysis,
Pearson correlation coefficient, etc.) at this step to feed
As shown in Fig 1, the entire architecture of the proposed the model as many informative features as possible, and,
AI-based NIDS consists of four main streams: pre-processing, consequently, feature extraction can significantly impact the
generative model training, autoencoder training, and predictive performance of models in anomaly detection. However, we
model training. In this section, we describe the proposed do not consider the computational feature extraction process,
methodology and each module (process) in detail. as our framework embeds an autoencoder model that can
replace functionalities of feature extraction. Note that, in our
A. Pre-processing framework, the model with a computational feature extraction
Before building and training AI models, the system refines process did not show significant improvement compared with
a given raw dataset via the pre-processing module that consists the model without the feature extraction. A detailed descrip-
of three sub-processes: outlier analysis, one-hot encoding, and tion for deploying the autoencoder as a feature extractor is
feature scaling. presented later.
In the outlier analysis phase, the system eliminates outliers,
which can negatively affect the model training. Typically, B. Synthetic data generation with generative model
outliers are detected by quantifying the statistical distribution The synthetic data generation module builds and trains
of the datasets via robust measures of scale. There are several generative models using the dataset refined in the data pre-
standard robust measures of scale for detecting outliers, such processing module. In the case of the generative model, we
as interquartile range (IQR) and median absolute deviation utilize a state-of-the-art GAN model, BEGAN, which is based
(MAD). Among these measures, we leveraged the MAD. on the concept of autoencoders and the reconstruction error-
For a numeric attribute A = {x1 , x2 , ..., xn }, the MAD of based objective function. For the model architecture, we built
the attribute is defined as follows: the discriminator as a symmetric autoencoder model with five
layers and the generator with the same architecture as the
M AD = median(|xi − median(A)|). (6) decoder of the discriminator (autoencoder). Figure 4 illustrates
the entire architecture of the BEGAN model. Before training
We assume that numeric attributes appearing in the dataset
the BEGAN model, the system first splits the given dataset
follow a normal distribution. Then, a consistent estimator σ̂ for
according to the classes and then builds generative models
the estimation of the standard deviation is 1.4826 × M AD. In
for each split sub-dataset. That is, generative models are
terms of this estimator, we determine that for a given numeric
built in a number equal to the number of classes, and (after
attribute, values exceeding 10 × σ̂ are outliers. Obviously,
training) each generative model produces only synthetic data
outlier analysis is performed only on the numerical attributes
corresponding a particular class.
and conducted independently for each class. Note that, outlier
One of the important factors that must be considered when
removal should be performed before scaling features, as it can
applying GAN models to NIDS is the determination of the
potentially obscure information about outliers.
termination criteria of training, which has a significant impact
After filtering out the outliers, the system transforms nomi-
on the performance of anomaly detection, as it is directly
nal attributes into one-hot vectors. Each nominal (categorical)
related to the quality of the synthetic data to be trained on
attribute is represented as a binary vector with the size of
the detection model. The determination of the termination
the number of attribute values, where 1 is assigned only to
criteria stems from the tracking of the training convergence,
a point corresponding to the expressed value and 0 to all
and this is a difficult problem, as the objective function of
others. For example, in the case of the ‘protocol’ attribute
GAN models is defined to have the properties of a zero-sum
(commonly included in network traffic data) with the values
game. In general, monitoring the training progress has been
tcp, udp, and icmp, the attribute is transformed into a binary
conducted indirectly through visual inspection of synthetic
vector of length 3, and the attribute values are converted
(generated) data. However, even this approach is not feasible
into [1,0,0], [0,1,0], and [0,0,1], respectively. Together with
2 In our experiments, there was no significant difference between the two
1 Originally, E[L(G(z))]
the diversity ratio γ is defined as γ = E[L(x)]
feature scaling methods in terms of the performance of detection models.

Fig. 4. The architecture of generative model in our system.

in NIDS environments because the data being handled is not in

the form of an image. Fortunately, unlike other GAN models, Algorithm 1 Autoencoder training with generators
BEGAN can approximate the convergence of training through Input: training dataset Dtrain , a set of generators G
the concept of equilibrium, and this characteristic facilitates 0
1: Initialize Autoencoder parameters θAE
determination of the criteria for the training termination. The
2: for Gi ∈ G, where 1 ≤ i ≤ k do
convergence measure M of BEGAN is formulated as follows:
3: sample z = {zj }j=1,...,mi from the latent space
4: D̂i = Gi (z)
M = L(x) + |γL(x) − L(G(z))| (8)
5: end for
, where L(·) is the reconstruction error function, and γ is the 6: D̃ = Dtrain ∪ D̂1 ∪ · · · ∪ Dˆk
0
diversity ratio. 7: θAE = Train Autoencoder(θAE , D̃)
By utilizing the convergence measure, the system terminates 8: θenc = Extract Encoder(θAE )
the generative model’s training process. That is, when training Output : trained encoder θenc
the generative model, the system considers a threshold as
an input parameter and terminates the training process if the
convergence measure M outputs a value less than the given
threshold. In the experiment, we set the threshold of the compatible in terms of the model architecture, as it handles
convergence measure M to 0.058.3 the same data format as the detection model. After building
After training the generative model, the system generates an autoencoder model, the system trains it using the expanded
synthetic data according to the classes using the trained dataset composed in the previous module and then utilizes the
generator and integrates the generated dataset into the original trained encoder as the feature extraction module. Algorithm
training dataset. This expanded dataset is used to train the 1 presents a detailed process for autoencoder training, where
autoencoder and detection model in the next stage. Note that, mi (1 ≤ i ≤ k) indicates the magnitude of synthetic data to
although we designed the synthetic data generation module be generated for the class i. Note that, the trained encoder is
to build multiple generative models according to the number placed at the forefront (input layer) of the detection models
of classes, it can be built as a single model by integrating as a feature extractor and is set not to learn any more when
the concept of conditional GAN architecture [41], where class training detection models (i.e., we fix the model parameters
attributes are embedded in the input space. of the trained encoder when training the detection models).
For detection models, we utilized the basic DNN, CNN,
C. Learning the autoencoder and detection model and LSTM as classifiers. We designed the DNN model to
possess two hidden layers, and it could naturally process the
To build the intrusion detection model, the system first trains refined network traffic data in terms of the model training and
an autoencoder model that can provide feature extraction and classification task. In the case of the CNN model, because the
dimensionality reduction functionalities. In our framework, model was originally designed to be more suitable for analyz-
we designed the autoencoder to possess the same architec- ing image data, it required additional transformation processes
ture as the discriminator of the generative model. Because in the input data space or the layers of the model depending on
the deployed generative model is BEGAN, the discriminator the approach followed. In our system, we built the CNN model
has the form of an autoencoder, as depicted above, and is with one-dimensional (1D) convolutional layers to process the
3 In the learning process, if the convergence measure M does not fall below
network traffic data, rather than converting the input data (i.e.,
a given threshold, the process may fall into an infinite loop. To prevent this, network traffic data) into a two-dimensional space. As shown
we additionally set the maximum number of iterations. in Figure 5, we configured the CNN classifier to have two

Fig. 5. The basic architecture of generative adversarial networks. Fig. 6. Structure of LSTM cell.

1D-convolutional layers and one fully connected layer. For Algorithm 2 Classifier training with generators
LSTM, we designed the model to possess two recurrent layers
with the LSTM units and a fully connected layer, as shown Input: training dataset Dtrain , a set of generators G, trained
in Figure 6. LSTM is known to be particularly effective in encoder θenc
analyzing temporally correlated features [24]. Taking these 1: Initialize classifier parameters W 0
characteristics into account, we omitted the process of com- 2: for Gi ∈ G, where 1 ≤ i ≤ k do
bining with the autoencoder model for the LSTM model, since 3: sample z = {zj }j=1,...,mi from the latent space
the encoder may obscure the temporal features. For all models, 4: D̂i = Gi (z)
we designed the output layer with a binary field when the task 5: end for
was to detect anomalies, and with multi-valued fields when the 6: D̃ = Dtrain ∪ D̂1 ∪ · · · ∪ Dˆk
purpose was to distinguish not only the anomalies but also the 7: Set Trainable State on θenc = False
detailed threat types. Algorithm 2 presents a detailed workflow 8: Build Wθ0enc = Concatenate Models(θAE , W 0 )
for training a detection model with the trained generators and 9: Wθenc = Train Classifier(Wθ0enc , D̃)
the trained encoder. As with the autoencoder training process, Output : trained classifier Wθenc
the magnitude mi (1 ≤ i ≤ k) of synthetic data generation
can be set differently depending on the weight of each class.
Note that, the process of combining with the trained encoder
(line 7 and 8 in Algorithm 2) can be omitted according to the intrusion detection systems. Furthermore, we collected the
predictive model. real data from a large enterprise system and analyzed the
From the perspective of the entire framework, the system performance of the proposed model on the real dataset.
sequentially processes the data pre-processing, synthetic data
generation, and detection model training modules, and we 1) NSL-KDD dataset: The NSL-KDD dataset is a refined
refer to the whole system as G-DNNAE , G-CNNAE , and version of the KDDcup99 dataset [6][7] and consists of
G-LSTM, according to the type of the detection model. training and testing datasets, KDDTrain and KDDTest, with
Additionally, we subdivide the whole system into subsystems 125,973 and 22,544 rows, respectively.4 In each data point,
for a comprehensive comparison. In particular, we consider the there exist 41 attributes (3 nominal, 6 binary, and 32 numeric
DNN, CNN, and LSTM models as naı̈ve deep learning models attributes) presenting different features of the network flow
and DNNAE and CNNAE , which are models combined with and a label indicating an attack type or normal behavior. For
the autoencoder, as advanced deep learning models. In the the attack type, there exist four distinct attack profiles: Denial
experiment, we conducted a comparative analysis of G-LSTM, of Service (DoS), Probing, Remote to Local (R2L), and
G-DNNAE , and G-DNNAE with the subsystems. User to Root (U2R). DoS is an attack that depletes resources
by sending excessive traffic to the target system, thereby
V. E XPERIMENTS AND E VALUATIONS rendering it incapable of handling legitimate network traffic
or service access. In the case of a probing attack, attacker’s
In this section, we first review the target datasets and de- objective is to gain information about the target system (e.g.,
scribe the detailed implementation of each component. Then, scanning ports in use and sweeping IP addresses). R2L is
we present the experimental results with comparative analysis an attack that attempts to obtain local access from a remote
and evaluate the proposed systems. machine by sending remote fraudulent traffic to the target,
and behaviors such as password guessing and HTTP tunneling
A. Dataset description are considered R2L attacks. In the case of U2R, an attacker
In this work, we focused on three network traffic datasets 4 The original configuration of the dataset includes several sub-datasets.
that are widely used as benchmark datasets in the field of However, we only present the main training and testing datasets.

TABLE I TABLE II
DATA DISTRIBUTION IN NSL-KDD DATA DISTRIBUTION IN UNSW-NB15

Class Training Weight (%) Testing Weight (%) Class Training Weight (%) Testing Weight (%)

Normal 67,342 53.46% 9,710 43.07% Normal 56,000 31.94% 37,000 44.94%

DoS 45,927 36.46% 7,460 33.09% Generic 40,000 22.81% 18,871 22.92%

Probing 11,656 9.25% 2,421 10.74% Exploits 33,393 19.04% 11,132 13.52%

R2L 995 0.79% 2885 12.79% Fuzzers 18,184 10.37% 6,062 7.36%

U2R 52 0.041% 67 0.29% DoS 12,264 6.99% 4,089 4.97%

Total 125,973 100% 22,543 100% Reconnaissance 10,491 5.98% 3496 4.25%

Analysis 2,000 1.14% 677 0.82%

first gains access to the target system as an honest user and Backdoors 1,746 0.99% 583 0.71%
then attempts to gain root privileges by causing system faults
(e.g., buffer overflow and rootkit). Table 1 presents the entire Shellcode 1,133 0.65% 378 0.46%
distribution of the NSL-KDD dataset with respect to the
Worms 130 0.07% 44 0.05%
classes (attack classes and normal).
Total 175,341 100% 82,332 100%
2) UNSW-NB15 dataset: Together with the NSL-KDD
dataset presented above, the UNSW-NB15 dataset [8],
which was created by the IXIA PerfectStorm tool, has
been widely used as an experimental dataset in the field of
anomaly detection systems. Similarly, UNSW-NB15 consists scenario (named CTU-IoT-Malware-Capture-34-1). The
of training and testing datasets, UNSW-NB15 training and dataset contains 23,145 IoT network flows, where each data
UNSW-NB15 testing, with 175,341 and 82,332 records, point belongs to one of the following four classes: Benign,
respectively. Each record possesses 43 attributes that present C&C, DDos, and PortScan. Benign matches the normal
network flow features and two class attributes.5 The class class, and the others are treated as threats. C&C indicates
attributes consist of an attribute that indicates whether or not communication connected to the command & control server,
the record is normal traffic (binary-valued attribute) and the and PortScan refers to the activity of scanning ports to gather
type of attack (when the record is abnormal). For the attack information in order to conduct further attacks. For each data
type, there are nine distinct attack profiles that are intuitively point, there are 21 attributes (11 nominal, 2 binary, and 8
labeled as follows: Fuzzers, Analysis, Backdoors, DoS, numeric attributes) presenting different features of network
Exploits, Generic, Reconnaissance, Shellcode, and Worms. flow, and we removed four features that did not affect the
Table 2 presents the entire distribution of the UNSW-NB15 learning, such as id and IP address. To adjust the magnitude
dataset. Note that, we excluded any unnecessary attribute of normal class data considering the data imbalance scenario,
that did not affect the training of the models (”id” field) we randomly sampled 98,077 data from datasets in the benign
and combined the two class attributes into a single field. scenarios. Consequently, we configured the IoT data set to
Therefore, the dataset is considered to have 42 attributes (4 have 100,000 Benign data, 6,706 C&C data, 14,394 DDos
nominal, 2 binary, and 36 numeric attributes) and a class data, and 122 PortScan data.
attribute.
4) Real dataset: To evaluate the performance of our system
3) IoT dataset: In addition to the datasets NSL-KDD and in real-world environments, we collected raw security events
UNSW-NB15, we evaluated the performance of our system on from a large enterprise system. The data were collected over
a network traffic dataset, called IoT-23 [9], collected from the 5 months, where threats were logged separately by security
Internet of Things (IoT) devices. The IoT-23 dataset consists operations center (SOC) analysts whenever an intrusion oc-
of 20 sub-datasets collected from malicious IoT scenarios and curred. In the dataset, we investigated 798 cyber threats, which
three sub-datasets collected from benign scenarios. For these occurred evenly over the collection period (not focused on a
datasets, we utilized the dataset collected on the Mirai botnet specific period) and observed 547 system attacks, 240 scan-
ning, and 11 warm attacks (the categorizing was conducted
5 The raw dataset contains 47 attributes (excluding class attributes), in-
by the SOC analysts). In terms of the categories, the system
cluding source/destination IPs and ports. However, we used the provided
training/test dataset, in which features that do not affect AI training are attack includes cross-site scripting, DDoS, brute force attack,
excluded. and injection attack, whereas the scanning attack includes

TABLE III three layers. In the experiment, the one-layer structure showed
D ISTRIBUTION OF RAW SECURITY EVENTS IN THE REAL DATASET. high volatility, and the three-layer structure showed a tendency
to overfit. As a results, the models were most stable in the
Event ID Prefix Count Weight (%) two-layer structure and showed the highest performance.
For the DNN model, we set the first hidden layer to have 32
E2 UDP Packet Flooding 1,048,926 21.9% neurons and the second layer to have 16 neurons. For CNN,
E4 UDP Source-IP Flooding 718,788 15.2% we used a 1D-CNN model with two convolutional layers. The
E40 SIP Vulnerability Scanner 644,683 13.5% convolutional layers are configured to have 32 convolution
E7 TCP Connect DoS 553,362 11.6% filters with windows of size 5, and a fully connected layer of
.. .. .. 16 neurons follows. Additionally, we applied a max pooling
. . .
layer with windows of size 3 to the first convolutional layer,
E23 HTTPD Overflow 115,477 2.4% and the batch normalization layer after each convolutional
E29 NTP Amplification DDoS 107,617 2.3%
layer. For the activation function, we used ReLU as in the
... .. ..
. . generative model. In the case of LSTM, we connected 64
LSTM cells in each layer, and concatenated a fully connected
layer with 32 neurons. For these detection models, we set the
default number of epochs to 300 and applied the early stop
Trojan and backdoor attacks. In total, we collected 4,782,342 technique (we stopped learning when relative differences of
security event data, of which 230,026 were identified as cyber loss are less than 10−6 consecutively for 35 epochs [24]).
threats (i.e., 4,552,316 data were labeled as ”Normal,” and
230,026 data were labeled as ”Threat”). Each raw data has We utilized two additional basic machine learning models
16 basic features for network flow information, such as the as comparative models.
protocol type, service, and source bytes (8 nominal and 8
numeric attributes). Moreover, because the collected data are • Support Vector Machine (SVM) is a supervised learning
raw security events, each data includes information regarding model based on the statistical learning theory and aims
the suspicious security event6 . Table 3 presents a distribution to locate the best hyperplane that can optimally separate
of the collected dataset with respect to the suspicious security input domains according to the classes. In the experiment,
events, and it can be seen that the false positives are relatively we implemented the linear kernel SVM model [2].
high (see [43] for a detailed description of the collected real • Decision Tree (DT) is a non-parametric supervised learn-
dataset). Note that, although there were several detailed classes ing model, and it recursively splits input domains based
of detected attacks, each data was categorized as ”Normal” and on the correlation between each feature and class. In this
”Threat” only (related to the privacy issues of the enterprise). study, we implemented the C4.5 algorithm [1].
For a more extensive comparison, we subdivided the com-
B. Implementation and hyperparameters tuning ponents of our system, DNN, CNN, LSTM, DNNAE , and
CNNAE , and utilized them as comparative models with the
As described in the previous section, we set the discrimi-
whole system. Note that, we regard these sub-models to
nator of the generative model to be a symmetric autoencoder
correspond to the existing AI-based NIDS. In particular, DNN,
model with three layers. For this model, we constructed the
CNN, and LSTM are considered as naı̈ve deep learning
first hidden layer with 80 neurons and a latent space dimension
approaches. In the case of DNNAE and CNNAE , they are
with a size of 50. Therefore, the generator is set to have
considered as advanced deep learning approaches combined
the latent space of size 50 and a hidden layer of size 80.
with autoencoders7 .
Additionally, we applied batch normalization to each hidden
In the experiment, we utilized four metrics to evaluate
layer for stability of learning and used the Rectified Linear
the performance of AI models: Accuracy, P recision,
Unit (ReLU) as the activation function. Note that, because we
Recall, and F 1-score. Accuracy refers to the fraction of
configured the autoencoder as a feature extractor with the same
correctly inferred results and is commonly used to quantify
architecture as the discriminator, the above configuration cor-
the performance of AI models. For a given class in a dataset,
responds to that of the autoencoder as well. In the case of the
P recision presents the fraction of positive values inferred by
generative model, we set the convergence threshold to 0.058,
the model that are correct, while Recall refers to the fraction
and terminated training when the convergence measure fell
of data with positive values that are correctly inferred by the
below the given threshold, or the number of epochs reached
model. The F 1-score is the harmonic mean of P recision and
250. For autoencoder learning, we set the default number
Recall. The formulas of these metrics are defined as follows:
of epochs to 300 and stop training when the reconstruction
accuracy was above 0.97.
T P +T N
For the classifier models, we deployed three distinct deep • Accuracy = T P +F P +T N +F N
learning models: DNN, CNN and LSTM. Considering the
number of features, we explored the depth of the models up to
7 Although the detailed architecture and configurations may differ from
6 Note that, the suspicious security event can be different from the labels those of the previous approaches, we stress that the implemented models are
classified by the SOC analysts. comparable or outperform in terms of performance to the existing systems.

TABLE IV
B INARY CLASSIFICATION RESULTS FOR THE TEST DATASET IN NSL-KDD.

Normal Abnormal
Classifier Accuracy Recall P recision F 1-score Recall P recision F 1-score
SVM 72.1% 97.8% 61.2% 75.2% 53.1% 96.9% 68.6%
DT 81.5% 97.3% 70.8% 81.9% 69.6% 97.1% 81.0%
DNN 79.5% 96.2% 67.7% 79.6% 67.8% 96.2% 79.6%
CNN 80.5% 96.5% 68.7% 80.3% 69.5% 96.6% 80.8%
LSTM 82.0% 97.5% 71.0% 82.1% 70.0% 97.2% 81.3%
DNNAE 85.5% 98.8% 78.0% 87.2% 72.5% 98.5% 83.5%
CNNAE 86.4% 98.8% 79.0% 87.8% 74.1% 98.4% 84.6%

G-LSTM 85.5% 98.5% 78.3% 87.2% 72.5% 98.5% 83.5%

G-DNNAE 89.8% 98.2% 84.3% 90.6% 81.5% 97.9% 89.0%
G-CNNAE 90.3% 97.2% 85.3% 90.9% 83.5% 96.8% 89.7%

TP
• P recision = T P +F P

TP
• Recall = T P +F N

P recision×Recall
• F 1-score = 2 × P recision+Recall

, where TP, TN, FN, and FP denote true positive, true negative,
false negative, and false positive, respectively.
Using these metrics, we evaluated each model on the
experimental datasets. Note that, although we built the models
with a stable structure, there was still the issue of volatility.
Accordingly, with respect to comparison and evaluation, we
independently trained each model 100 times, and displayed
the results for the model with the best detection rate in the Fig. 7. Comparison of binary classification results on the NSL-KDD dataset.
test dataset.

C. Experiments on the NSL-KDD dataset comparison of experimental results for the NSL-KDD dataset
For the NSL-KDD dataset, we explored both binary and in the binary classification scenario.
multi-classification tasks. Note that, NSL-KDD is provided Overall, the models output relatively high recall values
separately as a training dataset and a test dataset as mentioned for the data belonging to the normal class and, conversely,
above, and we used these datasets in our experiments as showed relatively high precision values for the abnormal class.
provided. In other words, we used KDDTrain (125,973 rows) For the basic machine learning models, the DT outperformed
as a training dataset and KDDTest (22,544 rows) as a test the SVM model, with an accuracy of 81.5%. Moreover,
dataset, and there was no data shuffling between the two the DT model performed better than the naı̈ve DNN and
datasets. In the experiments on our system (i.e., G-DNNAE , G- CNN models, where DNN achieved an accuracy of 79.5%
CNNAE , and G-LSTM), we generated synthetic data for each and CNN achieved an accuracy of 80.5%. Among the basic
class via the generative model, and integrated them into the models and the naı̈ve models, the LSTM model outperformed
training dataset. Obviously, the evaluation of all models was others with an accuracy of 82.0%. For the advanced deep
conducted on the original test dataset (KDDTest) for unbiased learning approaches, both DNNAE and CNNAE exhibited
comparisons. better results than the basic machine learning and the naı̈ve
1) Binary classification: Table 4 presents the experimental deep learning models. The advanced models, DNNAE and
results for the binary classification task on the NSL-KDD CNNAE achieved an 85.5% accuracy and 86.4% accuracy,
dataset. Note that, the data belonging to the attack classes are respectively. The proposed models, to which the generative
naturally considered anomalies in the binary classification task and autoencoder had been applied, were found to significantly
(labeled as abnormal). In the experiments on our system, we outperform all the aforementioned models. In particular,
generated a total of 35,000 additional data (synthetic data) for both G-DNNAE and G-CNNAE achieved an accuracy close
each class via the trained generative module. Figure 7 shows a to 90%, and it was observed that G-CNNAE produced the

TABLE V
M ULTI - CLASSIFICATION RESULTS FOR THE TEST DATASET IN NSL-KDD.

DoS Probe
Algorithm Accuracy Recall P recision F 1-score Recall P recision F 1-score
SVM 75.4% 76.3% 96.7% 80.0% 33.3% 80.0% 47.1%
DT 80.5% 84.2% 94.1% 88.9% 50.0% 85.7% 63.2%
DNN 79.6% 83.8% 94.8% 89.0% 31.5% 65.4% 42.5%
CNN 80.1% 89.6% 94.4% 91.9% 30.8% 69.5% 42.7%
LSTM 82.6% 92.1% 98.4% 95.1% 28.7% 69.5% 40.6%
DNNAE 88.3% 94.9% 99.0% 96.9% 93.0% 78.6% 85.2%
CNNAE 88.5% 95.7% 99.0% 97.3% 93.7% 78.4% 85.4%

G-LSTM 87.3% 92.2% 98.4% 95.1% 94.7% 80.2% 84.8%

G-DNNAE 92.7% 96.0% 98.2% 97.1% 95.2% 81.1% 87.6%
G-CNNAE 93.2% 96.0% 97.3% 96.7% 98.2% 80.5% 88.5%

R2L U2R
Algorithm Recall P recision F 1-score Recall P recision F 1-score
SVM - - - - - -
DT 11.1% 50.0% 18.2% - - -
DNN 26.0% 48.6% 33.9% 4.6% 74.9% 8.8%
CNN 24.1% 65.1% 35.2% 6.2% 79.9% 11.5%
LSTM 21.8% 63.4% 32.4% 5.9% 76.8% 10.9%
DNNAE 42.7% 93.7% 58.7% 7.8% 83.3% 14.2%
CNNAE 41.1% 93.2% 57.0% 9.3% 85.7% 16.8%

G-LSTM 54.9% 80.4% 65.2% 10.2% 81.7% 18.1%

G-DNNAE 79.8% 80.5% 80.1% 12.4% 79.9% 21.5%
G-CNNAE 92.8% 70.2% 80.0% 11.4% 81.9% 20.1%

highest performance with an accuracy of 90.3%. In the case

of LSTM, the generator combined LSTM model performed
slightly better than the naı̈ve LSTM, but was measured to be
inferior to CNNAE .

2) Multi-classification: Table 5 presents the experimental

results for the multi-classification task on the NSL-KDD
dataset.8 Unlike the binary classification scenario, the system
could further recognize the type of threat that the data be-
longed to and hence, generate synthetic data with different
magnitudes based on weights in the population. In the exper-
iments on our system, we generated synthetic data for minor
classes with less than 10% weight in the distribution. That is,
we generated synthetic data for Probe, R2L, and U2R classes Fig. 8. Comparison of multi-classification results on the NSL-KDD dataset.
(10,000 synthetic data for each class) via the trained generative
model. Figure 8 shows a comparison of experimental results
for the NSL-KDD dataset in the multi-classification scenario.
In the case of the basic machine learning models and the
8 In the multi-classification scenario, we only presented experimental results
naı̈ve deep learning approaches, the results obtained were
for the attack classes. Experimental results for the normal class follow the similar to those obtained for the binary classification scenario.
previous experiment (i.e., experiments in the binary classification scenario) In particular, the DT model showed better results than the

TABLE VI
C LASSIFICATION ACCURACY FOR EACH THREAT CLASS ON THE UNSW-NB15 DATASET.

Algorithm Generic Exploit Fuzzers DoS Reconnaissance Analysis Backdoors Shellcode Worms
DNN 76.8% 48.5% 72.9% 28.0% 49.8% 76.3% 87.3% 57.9% 52.2%
CNN 76.7% 48.6% 75.2% 27.9% 49.9% 76.8% 88.3% 58.2% 52.2%
LSTM 76.8% 48.6% 73.2% 29.4% 49.9% 76.6% 87.3% 58.0% 52.2%
DNNAE 76.8% 48.4% 74.1% 28.4% 50.1% 77.1% 88.5% 58.7% 52.2%
CNNAE 76.8% 49.1% 74.3% 28.4% 49.9% 77.5% 88.5% 58.2% 54.5%

G-LSTM 80.1% 49.0% 79.6% 29.4% 50.1% 77.1% 90.8% 58.2% 56.8%
G-DNNAE 80.6% 50.3% 81.6% 29.4% 51.3% 77.2% 91.4% 58.7% 56.8%
G-CNNAE 82.0% 50.2% 81.9% 29.1% 51.3% 77.5% 91.5% 58.9% 56.8%

SVM, DNN, and CNN models in terms of the accuracy

metric. However, the basic machine learning models showed
poor results for the minor classes. In particular, they showed
extremely low F 1-scores for the R2L and U2R classes, and
even failed to classify. On the contrary, although the detection
performance was insufficient, the neural network-based models
performed better than the SVM and DT models in the minor
classes R2L and U2R. In comparison with the basic models
and the naı̈ve models, the LSTM model outperformed others
as with the binary classification scenario, and showed better
performance in the temporally correlaed attack (i.e., DoS
attack).
The advanced deep learning models, however, achieved
better overall classification performance than the basic ma- Fig. 9. Comparison of multi-classification results on the UNSW-NB15 dataset.
chine learning models and the naı̈ve deep learning models,
where DNNAE achieved an accuracy of 88.3%, and CNNAE
achieved an accuracy of 88.5%. In particular, the models the same conditions, and no significant differences were found
combined with autoencoder demonstrated significant improve- between the two models.
ment in the Dos, Probe, and R2L classes. However, compared
with the naı̈ve deep learning models, they did not improve D. Experiments on the UNSW-NB15 dataset
the classification performance for U2R, which is extremely To compare the performance of models in a dataset with
minor. Note that, although the results seem to have improved more diverse classes, we conducted experiments on the
numerically, there is not much difference in terms of the UNSW-NB15 dataset as another multi-classification scenario.
number of data. In the case of our models, G-DNNAE and As described above, UNSW-NB15 has 10 classes, including
G-CNNAE achieved the best performance compared with that a normal class, three major and six minor classes. For the
of the other models and achieved an accuracy of 92.7% minor classes, we determined that classes with a weight of
and 93.2%, respectively. From the perspective of the minor less than 1% are extremely minor. As with the experiments
classes, the proposed models comprehensively improved the on the NSL-KDD dataset, we used the original UNSW-NB15
classification performance and showed a notable improvement training and testing datasets (175,341 and 82,332 records,
in the classification for the R2L class. Note that, we did not respectively). Similarly, we generated synthetic data for each
generate additional synthetic data for the DoS class (a major class via the generative model in the experiments on our
class) as mentioned above, and it can be observed that the system, and integrated them into the training dataset. Note
results were measured in a manner similar to the advanced that, the evaluation of all models was conducted on the original
deep learning models. UNSW-NB15 testing dataset.
In summary, we found that neural network-based mod- Table 6 presents the experimental results for the multi-
els combined with autoencoders could significantly improve classification scenario on the UNSW-NB15 dataset. In the
the classification performance in both the binary and multi- experiments, we generated synthetic data for all classes. In
classification tasks, and they can be further improved by particular, we generated synthetic data to reach a total size
applying the generative model. From the perspective of the of 50,000 for each major class and a total size of 30,000
base model architecture, the DNN-based model and the CNN- for each minor class. Additionally, we assumed that for a
based model showed similar classification performance under given threat data, the classification was correct if the model

TABLE VII
E XPERIMENTAL RESULTS ON THE I OT-23 DATASET FOR MULTI - CLASSIFICATION TASKS .

DDoS C&C PortScan

Classifier Accuracy Recall P recision F 1-score Recall P recision F 1-score Recall P recision F 1-score
DNN 93.1% 100% 100% 100% 47.0% 100% 63.9% 100% 83.7% 91.1%
CNN 93.7% 100% 100% 100% 48.4% 100% 65.2% 100% 86.4% 92.7%
LSTM 93.5% 100% 100% 100% 47.0% 100% 63.9% 100% 85.4% 92.1%
DNNAE 93.7% 100% 100% 100% 48.4% 100% 65.2% 100% 86.4% 92.7%
CNNAE 93.7% 100% 100% 100% 48.4% 100% 65.2% 100% 86.4% 92.7%

G-LSTM, DNNAE , CNNAE 95.9% 100% 100% 100% 80.0% 100% 88.9% 100% 90.4% 95.0%

classified the data into one of the classes corresponding to the

attack category (even if the model did not predict the exact
class). Accordingly, we only indicated the accuracy of the
performance measure in the experiment on the UNSW-NB15
dataset, considering whether the attack was well classified
as an attack. Figure 9 shows a comparison of experimental
results for the UNSW-NB15 dataset in the multi-classification
scenario.
As shown in Table 6, G-DNNAE and G-CNNAE outper-
formed other models in terms of the classification perfor-
mance. For the major classes, Generic, Exploit, and Fuzzers,
the naı̈ve and advanced deep learning models showed similar
performance, and it was observed that the proposed models
could improve the classification performance for the major
Fig. 10. Comparison of multi-classification results on the IoT-23 dataset.
classes even in the LSTM-based model. In particular, the
generator combined models showed significant performance
improvement in the Generic and Fuzzers classes (up to about dataset. As described above, we utilized the dataset collected
5%). In the case of the minor classes, the proposed mod- on the Mirai botnet scenario (CTU-IoT-Malware-Capture-34-
els showed a moderate performance improvement overall. 1) and intentionally simulated an extreme data imbalance
Especially, G-LSTM, G-DNNAE , and G-CNNAE achieved scenario. For evaluation, we randomly split the dataset into
about 3% performance improvement in the Backdoors class training and test datasets at a ratio of 7:3 in each class (i.e.,
(which possessed weights of approximately 1% within the 84,855 training data and 36,367 test data). In the experiments
distribution) compared with the other models. Through exper- on the proposed system, we generated synthetic data to attain
iments on the UNSW-NB15 dataset containing more diverse a total size of 30,000 for each malicious class in the training
classes, we found that the proposed model could improve the dataset, and evaluated the performance of all models using the
classification performance for major classes. Moreover, we previously separated test data (36,367 rows).
found that the implemented generative model could further
Table 7 presents the experimental results for the multi-
improve the classification performance in minor and extremely
classification task on the IoT-23 dataset, and Figure 10 shows
minor classes.
a comparison of experimental results. Overall, all the models
Although the proposed framework can improve the classi-
achieved an accuracy greater than 93%, and the models
fication performance, there is still the problem of relatively
were observed to have perfect classification performance for
low detection rates for some classes. In particular, all the
the DDoS class, even in the naı̈ve deep learning approach.
experimented models were observed to have relatively low
Moreover, we observed that there was no significant difference
detection rates for the DoS class, even in the LSTM-based
in the performance between the advanced model and the naı̈ve
model, which is suitable for detecting temporally correlated
model. In the case of the C&C class, all models showed 100%
attacks. Regarding these results, we infer that the domain space
probability in precision. For the proposed models, all the gen-
between classes is heavily overlapping [34], resulting in low
erator combined models showed the same performance, and
detection rates for some classes.
achieved significant improvement in recall with a probability
of 80%. These results are presumably due to the fact that
E. Experiments on the IoT dataset the IoT data set is very simple and has features that contain
To evaluate the performance of the proposed systems in powerful information related to the nature of the attack (e.g.,
IoT environments, we conducted experiments on the IoT-23 ‘history’). In addition, regarding these results, we conjecture

TABLE VIII
E XPERIMENTAL RESULTS ON THE REAL DATASET FOR BINARY- CLASSIFICATION TASKS .

Normal Abnormal
Classifier Accuracy Recall P recision F 1-score Recall P recision F 1-score
DNN 94.7% 97.0% 93.0% 94.9% 89.5% 76.9% 82.7%
CNN 95.0% 97.0% 94.5% 95.7% 90.0% 77.5% 83.2%
LSTM 95.2% 97.4% 94.6% 95.9% 89.8% 77.4% 83.1%
DNNAE 95.2% 97.2% 94.7% 95.9% 90.0% 77.5% 83.2%
CNNAE 95.2% 97.3% 94.6% 95.9% 90.2% 77.3% 83.2%

G-LSTM 95.2% 97.3% 94.6% 95.9% 89.8% 77.4% 83.1%

G-DNNAE 95.5% 97.2% 94.5% 95.8% 95.2% 92.5% 93.8%
G-CNNAE 95.6% 97.2% 94.5% 95.8% 95.2% 92.5% 93.8%

that the trained generative model has generated plausible data

points that fall within a certain region of the C&C distribution
(appearing in the test dataset, but not in the training dataset),
and partially covered the missing region in the (extended)
training dataset. Moreover, since there is a portion of the data
in the corresponding region in the test dataset, we estimate that
G-LSTM, G-DNNAE , and G-CNNAE performed significantly
higher than other models. For the PortScan class, which is
extremely minor, all models achieved 100% probability in
recall, and the proposed systems achieved the highest precision
value with a probability of 90.4%.

F. Experiments on the collected real dataset

To analyze the feasibility of the proposed system in a real Fig. 11. Comparison of binary classification results on the real dataset.
environment, we collected real network flow data with raw
security events from a large enterprise system and conducted
experiments on this real dataset. As in the above experiment, class, and we observed that the deployed generative model
we randomly split the collected dataset into training and test could significantly improve the classification performance of
datasets at a ratio of 7:3 in both normal and abnormal classes minor classes even in the real system.
(i.e., 3,347,639 training data and 1,434,703 test data). Note
that, we only considered the binary classification scenario in G. Evaluation
experiments on the real environment. As shown in Table 8, the Through comprehensive experiments on various datasets,
dataset possesses a severe imbalance between the normal and we demonstrated that the proposed system significantly out-
abnormal classes. In the experiments on the proposed system, performs previous deep learning approaches and showed that
we generated synthetic data for the abnormal class to be the the classification performance for minor classes can be greatly
same size as the normal class, and evaluated the performance improved through the generative model. In particular, the pro-
of all models using the previously partitioned test dataset as posed models showed a noticeable performance improvement
in the previous experiments. for the R2L and Probe classes on the NSL-KDD dataset. In
Table 8 presents the experimental results on the real dataset, addition, we confirmed that the proposed model can signif-
and Figure 11 shows a comparison of experimental results. icantly improve the detection rate for most classes on the
First of all, it can be seen that all models achieve a superior UNSW-NB15 dataset. Moreover, through experiments on the
performance in terms of the accuracy, as the dataset consists of IoT dataset, we observed that our system can efficiently detect
95.1% normal data and 4.9% anomalous data. Moreover, there network threats in a distributed environment. To demonstrate
was no significant difference between the naı̈ve and advanced the feasibility in real-world environments, we collected real
models in terms of the classification performance, as in the data and tested our system in the binary classification scenario.
experiment on the IoT dataset. From the perspective of each Through experiments on the real dataset, we demonstrated that
class, the models achieved high F 1-scores for normal data as the proposed model could improve the detection performance
expected, but relatively low recall values were measured for of network anomalies by resolving the data imbalance prob-
abnormal data. In the case of the proposed model, G-DNNAE lem, and that the proposed system can be effectively applied
and G-CNNAE achieved 93.8% F 1-scores in the abnormal in real-world environments.

VI. C ONCLUSION [12] K. Alrawashdeh and C. Purdy, “Toward an online anomaly intrusion
detection system based on deep learning,” in Proc. IEEE 15th Int. Conf.
In this study, we presented a novel AI-based NIDS that can Mach. Learn. Appl. (ICMLA), Anaheim, CA, USA, 2016, pp. 195–200.
efficiently resolve the data imbalance problem and improve [13] T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi, and M. Ghogho,
“Deep learning approach for network intrusion detection in software
the classification performance of the previous systems. To defined networking,” in Proc. Int. Conf. Wireless Netw. Mobile Commun.
address the data imbalance problem, we leveraged a state-of- (WINCOM), 2016, pp. 258–263.
the-art generative model that could generate plausible synthetic [14] Y. Imamverdiyev and F. Abdullayeva, “Deep learning method for denial
of service attack detection based on restricted Boltzmann machine,” Big
data and measure the convergence of training. Moreover, we Data, vol. 6, no. 2, pp. 159–169, Jun. 2018.
implemented autoencoder-driven detection models based on [15] W. Zhong, N. Yu, and C. Ai, “Applying big data based deep learning
DNN and CNN, and demonstrated that the proposed models system to intrusion detection,” Big Data Min. Anal., vol. 3, no. 3, pp.
181–195, Sep. 2020.
outperforms previous machine learning and deep learning [16] M. H. Haghighat and J. Li, “Intrusion detection system using vot-
approaches. The proposed system was analyzed on various ingbased neural network,” Tsinghua Sci. Technol., vol. 26, no. 4, pp.
datasets, including two benchmark datasets, an IoT dataset, 484–495, Aug. 2021.
[17] Y. Yang, X. Yang, M. Heidari, M. A. Khan, G. Srivastava, M. Khosravi,
and a real dataset. In particular, the proposed models achieved and L. Qi, “ASTREAM: Data-Stream-Driven Scalable Anomaly Detection
accuracies of up to 93.2% and 87% on the NSL-KDD dataset with Accuracy Guarantee in IIoT Environment,” IEEE Trans. Netw. Sci.
and the UNSW-NB15 dataset, respectively, and showed re- Eng., early access, Mar, 2022, doi: 10.1109/TNSE.2022.3157730.
[18] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation-based anomaly
markable performance improvement in the minor classes. In detection,” ACM Trans. Knowl. Discov. Data, vol. 6, no. 1, pp. 1–39,
addition, through experiments on an IoT dataset, we demon- Mar. 2012.
strated that the proposed system can efficiently detect network [19] X. Zhang, W. Dou, Q. He, R. Zhou, C. Leckie, R. Kotagiri, and Z.
threats in a distributed environment. Moreover, in order to Salcic, “LSHiForest: A generic framework for fast tree isolation based
ensemble anomaly analysis,” in Proc. IEEE 33rd Int. Conf. Data Eng.
investigate the feasibility in real-world environments, we col- (ICDE), Apr. 2017, pp. 983–994.
lected real data from a large enterprise system and evaluated [20] L. Qi, Y. Yang, X. Zhou, W. Rafique, and J. Ma, “Fast Anomaly Iden-
the proposed model on the collected dataset. Through this tification Based on Multi-Aspect Data Streams for Intelligent Intrusion
Detection Toward Secure Industry 4.0,” IEEE Trans. Ind. Inf., vol. 18,
experiment, we demonstrated that the proposed model can no.9, pp. 6503-6511, Sep. 2022.
significantly improve the detection rate of network threats by [21] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, “Long short term memory
resolving the data imbalance problem in the real environment. recurrent neural network classifier for intrusion detection,” in Proc. Int.
Conf. Platform Technol. Service (PlatCon), 2016, pp. 1–5.
In the future, by considering practical distributed environ- [22] C. Yin, Y. Zhu, J. Fei, and X. He, “A deep learning approach for
ments, we will focus on applying our framework to federated intrusion detection using recurrent neural networks,” IEEE Access, vol.
learning systems and ensemble AI systems to enhance network 5, pp. 21954–21961, 2017.
[23] C. Xu, J. Shen, X. Du, and F. Zhang, “An intrusion detection system
threat detection. In addition, we will study adversarial attacks using a deep neural network with gated recurrent units,” IEEE Access,
that can bypass AI-based NIDS through vulnerabilities in AI vol. 6, pp. 48697–48707, 2018.
models and conduct research on enhanced NIDS that can resist [24] J. Gao, L. Gan, F. Buschendorf, L. Zhang, H. Liu, P. Li, X. Dong, and T.
Lu, “Omni SCADA intrusion detection using deep learning algorithms,”
these attacks in real-world environments. IEEE Internet Things J., vol. 8, no. 2, pp. 951–961, Jan. 2021.
[25] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, “A deep learning approach
for network intrusion detection system,” EAI Endorsed Trans. Secur. Saf.,
R EFERENCES vol. 3, no. 9, p. e2, May 2016
[26] B. Yan and G. Han, “Effective feature extraction via stacked sparse
[1] J. R. Quinlan, “C4.5: Programs for machine learning,” Morgan Kaufmann autoencoder to improve intrusion detection system,” IEEE Access, vol. 6,
Ser. Mach. Learn., San Mateo, CA: Morgan Kaufmann, 1993. pp. 41238–41248, 2018.
[2] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector [27] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, “A deep learning approach
Machines and Other Kernel-Based Learning Methods. Cambridge, U.K.: to network intrusion detection,” IEEE Trans. Emerg. Topics Comput.
Cambridge Univ. Press, 2000. Intell., vol. 2, no. 1, pp. 41–50, Feb 2018.
[3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, [28] C. Ieracitano, A. Adeel, F. C. Morabito, and A. Hussain, “A novel
MA, USA: MIT Press, 2016. statistical analysis and autoencoder driven intelligent intrusion detection
[4] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th Int. approach,” Neurocomputing, vol. 387, pp. 51–62, Apr. 2020.
Conf. Neural Inf. Process. Syst. (NIPS), 2014, pp. 2672–2680. [29] J. Y. Kim, S. J. Bu, and S. B. Cho, “Malware detection using deep
[5] D. Berthelot, T. Schumm, and L. Metz, “BEGAN: Boundary equilib- transferred generative adversarial networks,” in Proc. Int. Conf. Neural
rium generative adversarial networks,” 2017, arXiv:1703.10717. [Online]. Inf. Process. Guangzhou, China: Springer, 2017, pp. 556–564.
Available: http://arxiv.org/abs/1703.10717 [30] M. H. Shahriar, N. I. Haque, M. A. Rahman, and M. Alonso, “G-IDS:
[6] S. Hettich and S. D. Bay. (1999). KDD Cup 1999 Data. [Online]. Generative adversarial networks assisted intrusion detection system,” in
Available: http://kdd.ics.uci.edu/databases/ kddcup99/kddcup99.html Proc. IEEE 44th Annu. Comput., Softw., Appl. Conf. (COMPSAC), Jul.
[7] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis 2020, pp. 376–385.
of the KDD CUP 99 data set,” in Proc. IEEE Symp. Comput. Intell. Secur. [31] I. Yilmaz, R. Masum, and A. Siraj, “Addressing imbalanced data
Defense Appl., Jul. 2009, pp. 1–6. problem with generative adversarial network for intrusion detection,” in
[8] N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set for Proc. IEEE 21st Int. Conf. Inf. Reuse Integr. Data Sci. (IRI), Las Vegas,
network intrusion detection systems (UNSW-NB15 network data set),” in NV, USA, 2020, pp. 25–30.
Proc. Military Commun. Inf. Syst. Conf. (MilCIS), 2015, pp. 1–6. [32] D. Li, D. Kotani, and Y. Okabe, “Improving attack detection perfor-
[9] A. Parmisano, S. Garcia, and M. J. Erquiaga. (2020). A Labeled Dataset mance in NIDS using GAN,” in Proc. IEEE 44th Annu. Comput., Softw.,
With Malicious and Benign IoT Network Traffic. [Online]. Available: Appl. Conf. (COMPSAC), Jul. 2020, pp. 817–825.
https://www.stratosphereips.org/datasets-iot23 [33] W. Lee, B. Noh, Y. Kim, and K. Jeong, “Generation of Network Traffic
[10] B. Ingre and A. Yadav, “Performance analysis of NSL-KDD dataset Using WGAN-GP and a DFT Filter for Resolving Data Imbalance,” in
using ANN,” in Proc. Int. Conf. Signal Process. Commun. Eng. Syst., Int. Conf. Internet Distrib. Comput. Syst. (IDCS), Springer, Oct. 2019,
Andhra Pradesh, India, Jan. 2015, pp. 92–96. pp. 306-317.
[11] Y. Gao, Y. Liu, Y. Jin, J. Chen, and H. Wu, “A novel semi-supervised [34] G. Dlamini and M. Fahim, “DGM: A data generative model to improve
learning approach for network intrusion detection on cloud-based robotic minority class presence in anomaly detection domain,” Neural Comput.
system,” IEEE Access, vol. 6, pp. 50927–50938, 2018. Appl., vol. 2021, pp. 13635–13646, Apr. 2021

[35] D. Li, D. Chen, J. Goh, and S.-k. Ng, “Anomaly detection with
generative adversarial networks for multivariate time series,” 2018,
arXiv:1809.04758. [Online]. Available: http://arxiv.org/abs/1809.04758
[36] S. K. Alabugin and A. N. Sokolov, “Applying of generative adversarial
networks for anomaly detection in industrial control systems,” in Proc.
Global Smart Ind. Conf. (GloSIC), Nov. 2020, pp. 199–203,
[37] I. Siniosoglou, P. Radoglou-Grammatikis, G. Efstathopoulos, P. Fouliras,
and P. Sarigiannidis, “A unified deep learning anomaly detection and
classification approach for smart grid environments,” IEEE Trans. Netw.
Service Manage., vol. 18, no. 2, pp. 1137–1151, Jun. 2021.
[38] D. E. Rumelhart and J. L. McClelland, “Learning internal representations
by error propagation,” in Proc. Parallel Distrib. Process., Explorations
Microstruct. Cogn., Found., vol. 1. Cambridge, MA, USA: MIT Press,
1987, pp. 318–362.
[39] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description
length and helmholtz free energy,” in Proc. 6th Int. Conf. Neural Inf.
Process. Syst., 1993, pp. 3–10.
[40] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
learning with deep convolutional generative adversarial networks,” 2016.
[Online]. Available: https://arxiv.org/abs/1511.06434.
[41] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”
2014, arXiv:1411.1784.
[42] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative ad-
versarial networks,” in Proc. 34th Int. Conf. Mach. Learn. (ICML), 2017,
pp. 214–223.
[43] J. Lee, J. Kim, I. Kim, and K. Han, “Cyber threat detection based on
artificial neural networks using event profiles,” IEEE Access, vol. 7, pp.
165607–165626, 2019.