Ig 19 022
Ig 19 022
Ig 19 022
Abstract—The scale of Internet-connected systems has in- in the cyberspace. On the attacker side, ML is utilized to
creased considerably, and these systems are being exposed to make attacks more sophisticated to pass through defense
cyber attacks more than ever. The complexity and dynamics of strategies. On the cyber security side, ML is employed to
cyber attacks require protecting mechanisms to be responsive,
adaptive, and large-scale. Machine learning, or more specifi- make defense strategies smarter, more robust, and higher
cally deep reinforcement learning (DRL), methods have been performance, which can adaptively prevent and reduce the
proposed widely to address these issues. By incorporating deep impacts or damages occurred. Among these ML applications,
learning into traditional RL, DRL is highly capable of solving unsupervised and supervised learning methods have been used
arXiv:1906.05799v2 [cs.CR] 20 Jun 2019
complex, dynamic, and especially high-dimensional cyber defense widely for intrusion detection [10]–[15], malware detection
problems. This paper presents a survey of DRL approaches
developed for cyber security. We touch on different vital as- [16]–[19], cyber-physical attacks [20], [21], and data privacy
pects, including DRL-based security methods for cyber-physical protection [22]. In principle, unsupervised methods explore
systems, autonomous intrusion detection techniques, and multi- the structure and patterns of data without using their labels
agent DRL-based game theory simulations for defense strategies while supervised methods learn by examples based on data’s
against cyber attacks. Extensive discussions and future research labels. These methods, however, cannot provide dynamic and
directions on DRL-based cyber security are also given. We expect
that this comprehensive review provides the foundations for and sequential responses against cyber attacks, especially new or
facilitates future studies on exploring the potential of emerging deformed threats. Also, the detection and defending responses
DRL to cope with increasingly complex cyber security problems. usually take place after the attacks, when the attack details and
data become available for collecting and analyzing, and thus
Index Terms—review, survey, cyber security, cyber defense, cy- proactive defense solutions are hindered. A statistical study
ber attacks, deep reinforcement learning, deep learning, Internet shows that 62% of the attacks were recognized after they have
of Things, IoT. caused significant damages to the cyber systems [23].
Reinforcement learning (RL), another branch of ML, is the
I. I NTRODUCTION closest form of human learning because it can learn by its
own experience through exploring and exploiting the unknown
Alternatively, Zhu et al. [47] explored MEC policies by The discount factor γ ∈ [0, 1] manages the importance levels
using the context awareness concept that represents the user’s of future rewards. It is applied as a mathematical trick to
context information and traffic pattern statistics. The use of analyze the learning convergence. In practice, discount is
AI technologies at the mobile network edges is advocated necessary because of partial observability or uncertainty of
to intelligently exploit operating environment and make the the stochastic environment.
right decisions regarding what, where, and how to cache Q-learning needs to use a lookup table or Q-table to store
appropriate contents. To increase the caching performance, a expected rewards (Q-values) of actions given a set of states.
DRL approach, i.e., the asynchronous advantage actor-critic This requires a large memory when the state and action spaces
algorithm [48], is used to find an optimal policy aiming to increase. Real-world problems often involve continuous state
maximize the offloading traffic. or action space, and therefore, Q-learning is inefficient to
Findings from our current survey show that applications of solve these problems. Fortunately, deep learning has emerged
DRL in cyber environments are generally categorized under as a powerful tool that is a great complement to traditional
two perspectives: optimizing and enhancing the communica- RL methods. With the power of function approximation and
tions and networking capabilities of the IoT applications, e.g. representation learning, deep learning can learn a compact
[49]–[56], and defending against cyber attacks. This paper fo- low-dimensional representation of raw high-dimensional data
cuses on the later where DRL methods are used to solve cyber [58]. The combination of deep learning and RL was the
security problems with the presence of cyber attacks or threats. research direction that Google DeepMind has initiated and
Next section provides a background of DRL methods, followed pioneered. They proposed deep Q-network (DQN) with the
by a detailed survey of DRL applications in cyber security use of a deep neural network (DNN) to enable Q-learning to
in Section 3. We group these applications into three major deal with high-dimensional sensory inputs [28], [59].
categories, including DRL-based security solutions for cyber-
physical systems, autonomous intrusion detection techniques,
and multi-agent DRL-based game theory for cyber security.
Section 4 concludes the paper with extensive discussions and
future research directions on DRL for cyber security.
II. D EEP R EINFORCEMENT L EARNING P RELIMINARY
Different from the other popular branch of ML, i.e., su-
pervised methods learning by examples, RL characterizes
an agent by creating its own learning experiences through
interacting directly with the environment. RL is described by
concepts of state, action, and reward (Fig. 1). It is a trial and
error approach in which the agent takes action at each time
step that causes two changes: current state of the environment
is changed to a new state, and the agent receives a reward or Fig. 2. DQN architecture with the loss function described by L(β) = E[(r +
penalty from the environment. Given a state, the reward is a γ maxa0 Q(s0 , a0 |β 0 ) − Q(s, a|β))2 ] where β and β 0 are parameters of the
estimation and target deep neural networks respectively. Each action taken by
function that can tell the agent how good or bad an action is. the agent will generate an experience, which consists of the current state s,
Based on received rewards, the agent learns to take more good action a, reward r and next state s0 . These learning experiences (samples) are
actions and gradually filter out bad actions. stored in the experience replay memory, which are then retrieved randomly
for a stable learning process.
As a value-based method, DQN takes long training time The value loss function of the critic is specified by:
and has limitations in solving continuous action space. Value- X
based methods, in general, evaluate the goodness of an action L1 = (R − V (s))2 (2)
given a state using the Q-value function. When the number of where R = r + γV (s0 ) is the discounted future reward. Also,
states or actions is large or infinite, they show inefficiency or the actor is pursuing minimization of the following policy loss
even impracticality. Another type of RL, i.e., policy gradient function:
methods, has solved this problem effectively. These methods
aim to derive actions directly by learning a policy π(s, a) L2 = − log(π(a|s)) ∗ A(s) − ϑH(π) (3)
that is a probability distribution over all possible actions. where A(s) = R − V (s) is the estimated advantage function,
Trust region policy optimization (TRPO) [60] and proximal and H(π) is the entropy term, which handles the exploration
policy optimization (PPO) [61] are notable policy gradient capability of the agent with the hyperparameter ϑ controlling
methods. The gradient estimation, however, has often suffered the strength of the entropy regularization. The advantage
a large fluctuation [62]. The combination of value-based and function A(s) shows how advantageous the agent is when
policy-based methods has thus been developed to aggregate it is in a particular state. The learning process of A3C is
advantages and eradicate disadvantages of these two methods. asynchronous because each learner interacts with its separate
That kind of combination has been structured in another type environment and updates the master network independently.
of RL, i.e., actor-critic methods. This structure comprises This process is iterated, and the master network is the one to
two components: an actor and a critic that can be both use when the learning is finished.
characterized by DNNs. The actor attempts to learn a policy Table I summarizes comparable features of value-based,
by receiving feedback from the critic. This iterative process policy-based, and actor-critic methods, and their typical ex-
helps the actor improve its strategy and converge to an optimal ample algorithms. The following section examines the venture
policy. Asynchronous advantage actor-critic (A3C) is a popular of these DRL algorithms in the field of cyber security under
actor-critic method where its structure consists of a hierarchy three broad categories: cyber-physical systems, autonomous
of a master learning agent (global) and individual learners intrusion detection, and cyber multi-agent game theory.
(workers) [48]. Both master agent and individual learners are
modeled by DNNs with each having two outputs: one for the III. DRL IN C YBER S ECURITY: A S URVEY
critic and another for the actor (Fig. 3). The first output is a
A large number of applications of RL to various aspects
scalar value representing the expected reward of a given state
of cyber security have been proposed in the literature, rang-
V (s) while the second output is a vector of values representing
ing from data privacy to critical infrastructure protection.
a probability distribution over all possible actions π(s, a).
However, drawbacks of traditional RL have restricted its
capability in solving complex and large-scale cyber security
problems. The increasing number of connected IoT devices in
recent years have led to a significant increase in the number
of cyber attack instances as well as their complexity. The
emergence of deep learning and its integration with RL have
created a generation of DRL methods that are able to detect
and fight against the most recent and sophisticated types of
cyber attacks. This section provides a comprehensive survey
of state-of-the-art DRL-powered solutions for cyber security,
ranging from defense methods for cyber-physical systems to
autonomous intrusion detection approaches, and game theory-
based multi-agent solutions.
TABLE I
S UMMARY OF FEATURES OF DRL TYPES AND THEIR NOTABLE METHODS
Typical
methods • DQN [28] • TRPO [60] • A3C [48]
• Double DQN [45] • PPO [61] • Unsupervised Reinforcement
• Dueling Q-network [46] • Deep Deterministic Policy Gradient (DDPG) [64] and Auxiliary Learning (UN-
• Prioritized Replay DQN [63] • Distributed Distributional DDPG (D4PG) [65] REAL) [66]
obtaining login credentials via the use of phishing emails. DQN and A3C algorithms, to find falsifying inputs (coun-
This attack caused a partial plant shutdown and resulted in terexamples) for CPS models. This allows for effective yet
damage of millions of dollars. Likewise, there was a costly automatic detection of CPS defects. Due to the infinite state
cyber attack to a power grid in Ukraine in late December 2015 space of CPS models, conventional methods such as simulated
that disrupted electricity supply to a few hundred thousand end annealing [77] and cross entropy [78] were found inefficient.
consumers [75]. Experimental results show the superiority of the use of DRL
In an effort to study cyber attacks on CPS, Feng et al. algorithms against those methods in terms of the smaller
[75] characterized the cyber state dynamics by a mathematical number of simulation runs. This leads to a more practical
model: detection process for CPS models’ defects despite the great
complexity of CPS’s software and physical systems.
ẋ(t) = f (t; x; u; w; θ(t; a; d)); x(t0 ) = x0 (4) Autonomous vehicles (AVs) operating in the future smart
cities require a robust processing unit of intra-vehicle sen-
where x,u, and w represent the physical state, control inputs
sors, including camera, radar, roadside smart sensors, and
and disturbances correspondingly (see Fig. 4). In addition,
inter-vehicle beaconing. Such reliance is vulnerable to cyber-
θ(t; a; d) describes cyber state at time t with a and d referring
physical attacks aiming to get control of AVs by manipulating
to cyber attack and defense respectively.
the sensory data and affecting the reliability of the system,
e.g., increasing accident risks or reducing the vehicular flow.
Ferdowsi et al. [79] examined the scenarios where the attackers
manage to interject faulty data to the AV’s sensor readings
while the AV (the defender) needs to deal with that problem
to control AV robustly. Specifically, the car following model
[80] is considered in which the focus is on autonomous
control of a car that follows closely another car. The defender
aims to learn the leading vehicle’s speed based on sensor
readings. The attacker’s objective is to mislead the following
vehicle to a deviation from the optimal safe spacing. The
interactions between attacker and defender are characterized
Fig. 4. The dynamics of attack and defense in a cyber-physical system. by a game-theoretic problem. The interactive game structure
The physical layer is often uncertain with disturbances w while cyber attack and its DRL solution are diagrammed in Fig. 5. Instead of
a directly affects the cyber layer where a defense strategy d needs to be directly deriving a solution based on the mixed-strategy Nash
implemented. The dynamics of attack-defense characterized by θ(t, a, d) is
injected into the conventional physical system to develop a cyber-physical equilibrium analytics, the authors proposed the use of DRL to
co-modelling as presented in Eq. (4) solve this dynamic game. Long short term memory (LSTM)
[81] is used to approximate the Q-function for both defending
The CPS defense problem is then modeled as a two-player and attacking agents as it can capture the temporal dynamics
zero-sum game by which utilities of players are summed up of the environment.
to zero at each time step. The defender is represented by Autonomous systems can be vulnerable to inefficiency from
an actor-critic DRL algorithm. Simulation results demonstrate various sources such as noises in communication channels,
that the proposed method in [75] can learn an optimal strategy sensor failures, errors in sensor measurement readings, packet
to timely and accurately defend the CPS from unknown cyber errors, and especially cyber attacks. Deception attack to
attacks. autonomous systems is widespread as it is initiated by an
Applications of CPS in critical safety domains such as adversary whose effort is to inject noises to the communi-
autonomous automotive, chemical process, automatic pilot cation channels between sensors and the command center.
avionics, and smart grid require a certain correctness level. This kind of attack leads to corrupted information being sent
Akazaki et al. [76] proposed the use of DRL, i.e., double to the command center and eventually degrades the system
5
provide secure offloading to the edge nodes against jamming band. The multi-agent RL approach proposed in [131] learns
attacks. MEC is a technique that allows cloud computing an optimal policy for each radio to select appropriate sub-band,
functions to take place at the edge nodes of a cellular net- aiming to avoid jamming signals and interruptions from other
work or generally of any network. This technology helps to radios. Comparative studies show the significant dominance
decrease network traffic, reduce overhead and latency when of the proposed method against a random policy. A drawback
users request to access contents that have been cached in the of the proposed method is the assumption that the jammer
edges closer to the cellular customer. MEC systems, however, uses a fixed strategy in responding to the WACRs strate-
are vulnerable to cyber attacks because they are physically gies, although the jammer may be able to perform adaptive
located closer to users and attackers with less secure protocols jamming with the cognitive radio technology. In [132], when
compared to cloud servers or database center. In [130], the RL the current spectrum sub-band is interfered by a jammer, Q-
methodology is used to select the defense levels and important learning is used to optimally select a new sub-band that allows
parameters such as offloading rate and time, transmit channel uninterrupted transmission as long as possible. The reward
and power. As the network state space is large, the authors structure of the Q-learning agent is defined as the amount of
proposed the use of DQN to handle high-dimensional data, time that the jammer or interferer takes to interfere with the
as illustrated in Fig. 7. DQN uses a convolutional neural WACR transmission. Experimental results using the hardware-
network (CNN) to approximate the Q-function that requires in-the-loop prototype simulation show that the agent can detect
high computational complexity and memory. To mitigate this the jamming patterns and successfully learns an optimal sub-
disadvantage a transfer learning method named hotbooting band selection policy for jamming avoidance. The obvious
technique is used. The hotbooting method helps to initialize drawback of this method is the use of Q-table with a limited
weights of CNN more efficiently by using experiences that number of environment states.
have been learned in similar circumstances. This reduces the The access right to spectrum (or more generally resources)
learning time by avoiding random explorations at the start of is the main difference between CRNs and traditional wireless
each episode. Simulation results demonstrate that the proposed technologies. RL in general or Q-learning has been investi-
method is effective in terms of enhancing the security and gated to produce optimal policy for cognitive radio nodes to
user privacy of MEC systems and it can protect the systems interact with their radio-frequency environment [133]. Attar
in confronting with different types of smart attacks with low et al. [134] examined RL solutions against attacks on both
overhead. CRN architectures, i.e., infrastructure-based, e.g., the IEEE
802.22 standard, and infrastructure-less, e.g., ad hoc CRN. The
adversaries may attempt to manipulate the spectrum sensing
process and cause the main sources of security threats in
infrastructure-less CRNs. The external adversary node is not
part of the CRN, but such attackers can affect the operation
of an ad hoc CRN via jamming attacks. In an infrastructure-
based CRN, an exogenous attacker can mount incumbent
emulation or perform sensor-jamming attacks. The attacker
can increase the local false-alarm probability to affect the
decision of the IEEE 802.22 base station about the availability
of a given band. A jamming attack can have both short-
term and long-term effects. Wang et al. [135] developed a
game-theoretic framework to battle against jamming in CRNs
Fig. 7. Secure offloading method in MEC based on DQN with hotbooting
technique. The DQN agent’ actions are to find optimal parameters such as
where each radio observes the status and quality of available
offloading rate, power, and channel for the mobile device to offload the traces channels and the strategy of jammers to make decisions
to the edge node accordingly. The attackers may deploy jamming, spoofing, accordingly. The CRN can learn optimal channel utilization
DoS, or smart attacks to disrupt this process. By interacting with the edge
caching systems, the agent can evaluate the reward of the previous action and
strategy using minimax-Q learning policy [136], solving the
obtain new state, enabling it to select the next optimal action. problems of how many channels to use for data and to
control packets along with the channel switching strategy. The
On the other hand, Aref et al. [131] introduced a multi- performance of minimax-Q learning represented via spectrum-
agent RL method to deal with anti-jamming communications efficient throughput is superior to the myopic learning method,
in wideband autonomous cognitive radios (WACRs). WACRs which gives high priority to the immediate payoff and ignores
are advanced radios that can sense the states of the radio the environment dynamics as well as the attackers’ cognitive
frequency spectrum and network, and autonomously optimize capability.
its operating mode corresponding to the perceived states. In CRNs, secondary users (SUs) are obliged to avoid disrup-
Cognitive communication protocols, however, may struggle tions to communications of primary users (PUs) and can only
when there are unintentional interferences or malicious users gain access to the licensed spectrum when it is not occupied
who attempt to interrupt reliable communications by deliber- by PUs. Jamming attacks are emergent in CRNs due to the
ate jamming. Each radio’s effort is to occupy the available opportunistic access of SUs as well as the appearance of smart
common wideband spectrum as much as possible and avoid jammers, which can detect the transmission strategy of SUs.
sweeping signal of a jammer that affects the entire spectrum Xiao et al. [137] studied the scenarios where a smart jammer
8
aims to disrupt the SUs rather than PUs. The SUs and jammer, the same channel-slot transmission structure with the users
therefore, must sense the channel to check the presence of as in [140]. The recursive CNN is utilized to deal with a
PUs before making their decisions. The constructed scenarios complex infinite environment state represented by spectrum
consist of a secondary source node supported by relay nodes to waterfall, which has a recursion characteristic. The model is
transmit data packets to secondary receiving nodes. The smart tested using several jamming scenarios, which include the
jammer can learn quickly the frequency and transmission sweeping jamming, comb jamming, dynamic jamming, and
power of SUs while SUs do not have full knowledge of the un- intelligent comb jamming. A disadvantage of both Han et al.
derlying dynamic environment. The interactions between SUs and Liu et al.’s methods is that they can only derive an optimal
and jammer are modeled as a cooperative transmission power policy for one user, which inspires a future research direction
control game, and the optimal strategy for SUs is derived based focusing on multiple users’ scenarios.
on the Stackelberg equilibrium [138]. The aim of SU players 2) Spoofing attacks: Spoofing attacks are popular in wire-
is to select appropriate transmission powers to efficiently send less networks where the attacker claims to be another node
data messages in the presence of jamming attacks. Jammer’s using the faked identity such as media access control to gain
utility gain is the SUs’ loss and vice versa. RL methods, i.e., access to the network illegitimately. This illegal penetration
Q-learning [57] and WoLF-PHC [139], are used to model may lead to man-in-the-middle, or DoS attacks [143]. Xiao
SUs as intelligent agents for coping with the smart jammer. et al. [144], [145] modeled the interactions between the
WoLF-PHC stands for the combination of Win or Learn Fast legitimate receiver and spoofers as a zero-sum authentication
algorithm and policy hill-climbing method. It uses a varying game and utilized Q-learning and Dyna-Q [146] algorithms to
learning rate to foster convergence to the game equilibrium by address the spoofing detection problem. The utility of receiver
adjusting the learning speed [139]. Simulation results show the or spoofer is computed based on the Bayesian risk, which
improvement in the anti-jamming performance of the proposed is the expected payoff in the spoofing detection. The receiver
method in terms of the signal to interference plus noise ratio aims to select the optimal test threshold in PHY-layer spoofing
(SINR). The optimal strategy achieved from the Stackelberg detection while the spoofer needs to select an optimal attacking
game can minimize the damage created by the jammer in the frequency. To prevent collisions, spoofers are cooperative to
worst-case scenario. attack the receiver. Simulation and experimental results show
Recently, Han et al. [140] introduced an anti-jamming sys- the improved performance of the proposed methods against the
tem for CRNs using the DQN algorithm based on a frequency- benchmark method with a fixed test threshold. A disadvantage
spatial anti-jamming communication game. The game simu- of the proposed approaches is that both action and state spaces
lates an environment of numerous jammers that inject jamming are quantized into discrete levels, bounded within a specified
signals to disturb the ongoing transmissions of SUs. The SU interval, which may lead to locally optimal solutions.
should not interfere with the communications of PUs and 3) Malware attacks: One of the most challenging malware
must defeat smart jammers. This communication system is of mobile devices is the zero-day attacks, which exploit
two-dimensional that utilizes both frequency hopping and user publicly unknown security vulnerabilities, and until they are
mobility. The RL state is the radio environment consisting of contained or mitigated, hackers might have already caused
PUs, SUs, jammers, and serving base station/access point. The adverse effects on computer programs, data or networks [147],
DQN is used to derive an optimal frequency hopping policy [148]. To avoid such attacks, the traces or log data produced
that determines whether the SU should leave an area of heavy by the applications need to be processed in real time. With
jamming or choose a channel to send signals. Experimental limited computational power, battery life and radio bandwidth,
results show the superiority of the DQN-based method against mobile devices often offload specific malware detection tasks
the Q-learning based strategy in terms of faster convergence to security servers at the cloud for processing. The security
rate, increased SINR, lower cost of defense, and improved server with powerful computational resources and more up-
utility of the SU. DQN with the core component CNN helps dated malware database can process the tasks quicker, more
to speed the learning process of the system, which has a large accurately, and then send a detection report back to mobile
number of frequency channels, compared with the benchmark devices with less delay. The offloading process is, therefore,
Q-learning method. the key factor affecting the cloud-based malware detection
To improve the work of Han et al. [140], Liu et al. performance. For example, if too many tasks are offloaded to
[141] also proposed an anti-jamming communication system the cloud server, there would be radio network congestion that
using a DRL method but having different and more ex- can lead to long detection delay. Wan et al. [149] enhanced the
tensive contributions. Specifically, Liu et al. [141] used the mobile offloading performance by improving the previously
raw spectrum information with temporal features, known as proposed game model in [150]. The Q-learning approach used
spectrum waterfall [142] to characterize the environment state in [150] to select optimal offloading rate suffers the curse
rather than using the SINR and PU occupancy as in [140]. of high-dimensionality when the network size is increased,
Because of this, Liu et al.’s model does not necessitate prior or a large number of feasible offloading rates is available
knowledge about the jamming patterns and parameters of for selection. Wan et al. [149] thus advocated the use of
the jammer but rather uses the local observation data. This hotbooting Q-learning and DQN, and showed the performance
prevents the model from the loss of information and facilitates improvement in terms of malware detection accuracy and
its adaptability to a dynamic environment. Furthermore, Liu speed compared to the standard Q-learning. The cloud-based
et al.’s work does not assume that the jammer needs to take malware detection approach using DQN for selecting the
9
offloading rate is illustrated in Fig. 8. attacks with its eminent performance is demonstrated via sev-
eral experiments using the popular network emulator Mininet
[157].
In an adversarial environment, the defender may not know
the private details of the attacker such as the type of at-
tack, attacking target, frequency, and location. Therefore, the
defender, for example, may allocate substantial resources to
protect an asset that is not a target of the attacker. The
defender needs to dynamically reconfigure defense strategies
to increase the complexity and cost for the intruder. Zhu et
al. [158] introduced a model where the defender and attacker
can repeatedly change the defense and attack strategies. The
defender has no prior knowledge about the attacker, such
as launched attacks and attacking policies. However, it is
Fig. 8. Cloud-based malware detection using DQN where the stochastic
gradient descent (SGD) method is used to update weights of the CNN. aware of the attacker classes and can access to the system
Malicious detection is performed in the cloud server with more powerful utilities, which are jointly contributed by the defense and
computational resources than mobile devices. The DQN agent helps to select attack activities. Two interactive RL methods are proposed for
optimal task offloading rates for mobile devices to avoid network congestion
and detection delay. By observing the network status and evaluating utility cyber defenses in [158], namely adaptive RL and robust RL.
based on malware detection report from the server, the agent can formulate The adaptive RL handles attacks that have a diminishing explo-
states and rewards, which are used to generate a sequence of optimal actions, ration rate (non-persistent attacker) while the robust RL deals
i.e., dynamic offloading rates.
with intruders who have constant exploration rate (persistent
4) Attacks in adversarial environment: Traditional net- attacker). The interactions between defender and attacker are
works facilitate the direct communications between client ap- illustrated via the attack and defense cycles as in Fig. 9. The
plication and server where each network has its switch control attackers and defenders do not take actions simultaneously or
that makes the network reconfiguration task time-consuming but asynchronously. On the attack cycle, the attacker evaluates
and inefficient. This method is also disadvantageous because previous attacks before launching a new attack if necessary.
the requested data may need to be retrieved from more than On the defense cycle, after receiving an alert, the defender
one database involving multiple servers. SDN is a next- carries out a meta-analysis on the latest attacks and calculates
generation networking technology as it can reconfigure the the corresponding utility before deploying a new defense if
network adaptively. With the control being programmable needed. An advantage of this system model is that it does
with a global view of the network architecture, SDN can not assume any underlying model for the attacker but instead
manage and optimize network resources effectively. RL has treats attack strategies as black boxes.
been demonstrated broadly in the literature as a robust method
for SDN controlling, e.g., [151]–[155].
Although RL’s success in SDN controlling is abundant, the
attacker may be able to falsify the defender’s training process
if it is aware of the network control algorithm in an adversarial
environment. To deal with this problem, Han et al. [156]
proposed the use of adversarial RL to build an autonomous
defense system for SDN. The attacker selects important nodes
in the network to compromise, for example, nodes in the
backbone network or the target subnet. By propagating through
the network, the attacker attempts to eventually compromise
Fig. 9. The defender and attacker interact via the intrusion detection system
the critical server while the defender prevents the server from (IDS) in an adversarial environment, involving defense and attack cycles.
being compromised and preserve as many unaffected nodes Using these two cycles, a defender and an attacker can repeatedly change
as possible. To achieve those goals, the RL defender takes their defense and attack strategies. This model can be used to study defense
strategies for different classes of attacks such as buffer over-read attacks [159]
four possible actions, consisting of “isolating”, “patching”, and code reuse attacks [160].
“reconnecting” and “migrating”. Two types of DRL agents
are trained to model defenders, i.e., double DQN and A3C, Alternatively, Elderman et al. [161] simulated cyber security
to select appropriate actions given different network states. problems in networking as a stochastic Markov game with
The reward is characterized based on the status of the critical two agents, one attacker, and one defender, with incomplete
server, the number of preserved nodes, migration cost, and information and partial observability. The attacker does not
the validity of the actions taken. That study considered the know the network topology but attempts to reach and get
scenarios where attackers can penetrate the learning process access to the location that contains a valuable asset. The
of RL defender by flipping reward signs or manipulating states. defender knows the internal network but does not see the attack
These causative attacks poison the defender’s training process types or position of intruders. This is a challenging cyber
and cause it to perform sub-optimal actions. The adversarial security game because a player needs to adapt its strategy to
training approach is applied to reduce the impact of poisoning defeat the unobservable opponents [162]. Different algorithms,
10
e.g., Monte Carlo learning, Q-learning, and neural networks collectively. There is thus a gap for future research where
are used to learn both defender and attacker. Simulation results DRL’s capabilities can be exploited fully to solve complex
show that Monte Carlo learning with the softmax exploration and sophisticated cyber intrusion detection problems.
is the best method for learning both attacking and defending Most DRL algorithms used for cyber defense so far are
strategies. Neural network algorithms have a limited adversar- model-free methods, which are sample inefficient as they
ial learning capability, and thus they are outperformed by Q- require a large quantity of training data. These data are difficult
learning and Monte Carlo learning techniques. This simulation to obtain in real cyber security practice. Researchers generally
has a disadvantage that simplifies the real-world cyber security utilize a simulator to validate their proposed approaches, but
problem into a game of only two players with only one asset. these simulators often do not characterize the complexity and
In real practice, there can be multiple hackers simultaneously dynamics of real cyber space of the IoT systems fully. Model-
penetrating a server that holds valuable data. Also, a network based DRL methods are more appropriate than model-free
may contain useful data in different locations instead of in a methods when training data are limitedly available because,
single location as simulated. with model-based DRL, it can be easy to collect data in a
scalable way. Exploration of model-based DRL methods or
the integration of model-based and model-free methods for
IV. D ISCUSSIONS AND F UTURE R ESEARCH D IRECTIONS
cyber defense is thus an interesting future study. For example,
DRL has emerged over recent years as one of the most function approximators can be used to learn a proxy model of
successful methods of designing and creating human or even the actual high-dimensional and possibly partial observable
superhuman AI agents. Much of these successes have relied on environment [163]–[165], which can be then employed to
the incorporation of DNNs into a framework of traditional RL deploy planning algorithms, e.g., Monte-Carlo tree search
to address complex and high-dimensional sequential decision- techniques [166], to derive optimal actions. Alternatively,
making problems. Applications of DRL algorithms, therefore, model-based and model-free combination approaches, such
have been found in various fields, including IoT and cyber as model-free policy with planning capabilities [167], [168]
security. Computers and the Internet today play crucial roles or model-based lookahead search [30], can be used as they
in many areas of our lives, e.g., entertainment, communication, aggregate advantages of both methods. On the other hand,
transportation, medicine, and even shopping. Lots of our current literature on applications of DRL to cyber security
personal information and important data are stored online. often limits at discretizing the action space, which restricts the
Even financial institutions, e.g., banks, mortgage companies, full capability of the DRL solutions to real-world problems.
and brokerage firms, run their business online. Therefore, An example is the application of DRL for selecting optimal
it is essential to have a security plan in place to prevent mobile offloading rates in [149], [150] where the action space
hackers from accessing our computer systems. This paper has has been discretized although a small change of the rate would
presented a comprehensive survey of DRL methods and their primarily affect the performance of the cloud-based malware
applications to cyber security problems, with notable examples detection system. Investigation of methods that can deal with
summarized in Table II. continuous action space in cyber environments, e.g., policy
The adversarial environment of cyber systems has instigated gradient and actor-critic algorithms, is another encouraging
various proposals of game theory models involving multiple research direction.
DRL agents. We found that this kind of application occupies AI can help defend against cyber attacks but can also
a major proportion of papers in the literature relating to DRL facilitate dangerous attacks, i.e., offensive AI. Hackers can
for cyber security problems. Another emerging area is the take advantages of AI to make attacks smarter and more
use of DRL for security solutions of cyber-physical systems. sophisticated to bypass detection methods to penetrate com-
The large-scale and complex nature of CPSs, e.g., environ- puter systems or networks. For example, hackers may employ
mental monitoring networks, electrical smart grid systems, algorithms to observe normal behaviors of users and employ
transportation management network, and cyber manufacturing the users’ patterns to develop untraceable attacking strategies.
management system, require security solutions to be respon- Machine learning-based systems can mimic humans to craft
sive and accurate. This has been addressed efficiently by convincing fake messages that are utilized to conduct large-
various DRL approaches, e.g., TRPO algorithm [82], LSTM- scale phishing attacks. Likewise, by creating highly realistic
Q-learning [79], double DQN, and A3C [76]. In contrast, fake video or audio messages based on AI advances (i.e.,
although there have been a large number of applications deepfakes), hackers can spread false news in elections or
of traditional RL methods to intrusion detection systems, manipulate financial markets [169]. Alternatively, attackers
there has been a small amount of work on DRL algorithms can poison the data pool used for training deep learning
for this kind of application. This is probably because the methods (i.e., machine learning poisoning) or attackers can
integration of deep learning and RL methods has just been manipulate the states or policies, falsify part of the reward
sustained very recently, i.e., in the last few years, which leaves signals in RL to trick the agent into taking sub-optimal actions,
researchers in the cyber intrusion detection area some time lag resulting in the agent being compromised [170]. These kinds
to catch up with. The complexity and dynamics of intrusion of attacks are difficult to prevent, detect, and fight against
detection can be addressed efficiently by representation learn- as they are part of a battle between AI systems. Adversarial
ing and function approximation capabilities of deep learning machine learning, especially supervised methods, have been
and optimal sequential decision making the capability of RL used extensively in cyber security [171] but very few studies
11
TABLE II
S UMMARY OF TYPICAL DRL APPLICATIONS IN CYBER SECURITY
have been found on using adversarial RL [172]. Adversarial cannot issue creative responses when new threats are intro-
DRL or DRL algorithms trained in various adversarial cyber duced. Moreover, human adversaries are always behind the
environments are worth comprehensive investigations as they cybercrime or cyber warfare. Therefore, there is a critical
can be a solution to battle against the increasingly complex need for human intellect teamed with machines for cyber
offensive AI systems. defenses. The traditional human-in-the-loop model for human-
With the support of AI systems, cyber security experts machine integration struggles to adapt quickly with cyber
no longer examine a huge volume of attack data manually defense system because autonomous agent carries out part of
to detect and defend against cyber attacks. This has many the task and need to halt to wait for human’s responses before
advantages because the security teams alone cannot sustain completing the task. The modern human-on-the-loop model
the volume. AI-enabled defense strategies can be automated would be a solution for a future human-machine teaming cyber
and deployed rapidly and efficiently but these systems alone security system. This model allows agents to autonomously
12
perform the task whilst humans can monitor and intervene [10] Garcia-Teodoro, P., Diaz-Verdejo, J., Maci-Fernndez, G., and Vzquez, E.
operations of agents only when necessary. How to integrate hu- (2009). Anomaly-based network intrusion detection: Techniques, systems
and challenges. Computers and Security, 28(1-2), 18-28.
man knowledge into DRL algorithms [173] under the human- [11] Dua, S., and Du, X. (2011). Data Mining and Machine Learning in
on-the-loop model for cyber defense is an interesting research Cybersecurity. CRC Press.
question. [12] Buczak, A. L., and Guven, E. (2016). A survey of data mining and
machine learning methods for cyber security intrusion detection. IEEE
As hackers utilized more and more sophisticated and large- Communications Surveys and Tutorials, 18(2), 1153-1176.
scale approach to attack computer systems and networks, the [13] Apruzzese, G., Colajanni, M., Ferretti, L., Guido, A., and Marchetti, M.
defense strategies need to be more intelligent and large-scale (2018, May). On the effectiveness of machine and deep learning for cyber
security. In 2018 International Conference on Cyber Conflict (CyCon)
as well. Multi-agent DRL is a research direction that can be (pp. 371-390). IEEE.
explored to tackle this problem. Game theory models for cyber [14] Biswas, S. K. (2018). Intrusion detection using machine learning: A
security reviewed in this paper have involved multiple agents comparison study. International Journal of Pure and Applied Mathematics,
but they are restricted at a couple of attackers and defenders 118(19), 101-114.
[15] Xin, Y., Kong, L., Liu, Z., Chen, Y., Li, Y., Zhu, H., Gao, M., Hou, H.
with limited communication, cooperation and coordination and Wang, C. (2018). Machine learning and deep learning methods for
among the agents. These aspects of multi-agent DRL need cybersecurity. IEEE Access, 6, 35365-35381.
to be investigated thoroughly in cyber security problems to [16] Milosevic, N., Dehghantanha, A., and Choo, K. K. R. (2017). Machine
learning aided Android malware classification. Computers and Electrical
enable an effective large-scale defense plan. Challenges of Engineering, 61, 266-274.
multi-agent DRL itself then need to be addressed such as [17] Mohammed Harun Babu, R., Vinayakumar, R., and Soman, K. P. (2018).
non-stationarity, partial observability, and efficient multi-agent A short review on applications of deep learning for cyber security. arXiv
preprint arXiv:1812.06292.
training schemes [174]. On the other hand, the RL methodol-
[18] Rege, M., and Mbah, R. B. K. (2018). Machine learning for cyber
ogy has been applied to deal with various cyber attacks, e.g. defense and attack. In The Seventh International Conference on Data
jamming, spoofing, false data injection, malware, DoS, DDoS, Analytics (pp. 73-78). IARIA.
brute force, Heartbleed, botnet, web attack, and infiltration [19] Berman, D. S., Buczak, A. L., Chavis, J. S., and Corbett, C. L. (2019).
A survey of deep learning methods for cyber security. Information, 10(4),
attack [175]–[180]. However, recently emerged or new types 122.
of attacks have been largely unaddressed. One of these new [20] Ding, D., Han, Q. L., Xiang, Y., Ge, X., and Zhang, X. M. (2018).
types is the bit-and-piece DDoS attack. This attack injects A survey on security control and attack detection for industrial cyber-
physical systems. Neurocomputing, 275, 1674-1683.
small junk into legitimate traffic of over a large number of [21] Wu, M., Song, Z., and Moon, Y. B. (2019). Detecting cyber-physical
IP addresses so that it can bypass many detection methods as attacks in CyberManufacturing systems with machine learning methods.
there is so little of it per address. Another emerging attack, Journal of Intelligent Manufacturing, 30(3), 1111-1123.
for instance, is attacking from the computing cloud to breach [22] Xiao, L., Wan, X., Lu, X., Zhang, Y., and Wu, D. (2018). IoT security
techniques based on machine learning. arXiv preprint arXiv:1801.06275.
systems of companies who manage IT systems for other firms [23] Sharma, A., Kalbarczyk, Z., Barlow, J., and Iyer, R. (2011, June).
or host other firms’ data on their servers. Alternatively, hackers Analysis of security data from a large computing organization. In Depend-
can use quantum physics-based powerful computers to crack able Systems and Networks (DSN), 2011 IEEE/IFIP 41st International
Conference on (pp. 506-517). IEEE.
encryption algorithms that are currently used to protect various [24] Ling, M. H., Yau, K. L. A., Qadir, J., Poh, G. S., and Ni, Q.
types of invaluable data [169]. Consequently, a future study (2015). Application of reinforcement learning for security enhancement
on addressing these new types of attacks is encouraged. in cognitive radio networks. Applied Soft Computing, 37, 809-829.
[25] Wang, Y., Ye, Z., Wan, P., and Zhao, J. (2019). A survey of dynamic
spectrum allocation based on reinforcement learning algorithms in cog-
nitive radio networks. Artificial Intelligence Review, 51(3), 493-506.
R EFERENCES
[26] Nguyen, N. D., Nguyen, T., and Nahavandi, S. (2017). System design
[1] Kreutz, D., Ramos, F. M., Verissimo, P., Rothenberg, C. E., Azodolmolky, perspective for human-level agents using deep reinforcement learning: A
S., and Uhlig, S. (2015). Software-defined networking: A comprehensive survey. IEEE Access, 5, 27091-27102.
survey. Proceedings of the IEEE, 103(1), 14-76. [27] Nguyen, T. T. (2018). A multi-objective deep reinforcement learning
[2] Xia, W., Wen, Y., Foh, C. H., Niyato, D., and Xie, H. (2015). A framework. arXiv preprint arXiv:1803.02965.
survey on software-defined networking. IEEE Communications Surveys [28] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Belle-
and Tutorials, 17(1), 27-51. mare, M. G., ... and Petersen, S. (2015). Human-level control through
[3] Kakalou, I., Psannis, K. E., Krawiec, P., and Badea, R. (2017). Cognitive deep reinforcement learning. Nature, 518(7540), 529.
radio network and network service chaining toward 5G: Challenges and [29] Nguyen, N. D., Nahavandi, S., and Nguyen, T. (2018, October). A
requirements. IEEE Communications Magazine, 55(11), 145-151. human mixed strategy approach to deep reinforcement learning. In 2018
[4] Naeem, A., Rehmani, M. H., Saleem, Y., Rashid, I., and Crespi, N. (2017). IEEE International Conference on Systems, Man, and Cybernetics (SMC)
Network coding in cognitive radio networks: A comprehensive survey. (pp. 4023-4028). IEEE.
IEEE Communications Surveys and Tutorials, 19(3), 1945-1973. [30] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den
[5] Botta, A., De Donato, W., Persico, V., and Pescap, A. (2016). Integration Driessche, G., ... and Dieleman, S. (2016). Mastering the game of Go
of cloud computing and internet of things: A survey. Future Generation with deep neural networks and tree search. Nature, 529(7587), 484.
Computer Systems, 56, 684-700. [31] Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A.,
[6] Shi, W., Cao, J., Zhang, Q., Li, Y., and Xu, L. (2016). Edge computing: Guez, A., ... and Chen, Y. (2017). Mastering the game of Go without
Vision and challenges. IEEE Internet of Things Journal, 3(5), 637-646. human knowledge. Nature, 550(7676), 354.
[7] Abbas, N., Zhang, Y., Taherkordi, A., and Skeie, T. (2018). Mobile edge [32] Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A. S.,
computing: A survey. IEEE Internet of Things Journal, 5(1), 450-465. Yeo, M., ... and Quan, J. (2017). StarCraft II: A new challenge for
[8] Dastjerdi, A. V., and Buyya, R. (2016). Fog computing: Helping the reinforcement learning. arXiv preprint arXiv:1708.04782.
Internet of Things realize its potential. Computer, 49(8), 112-116. [33] Sun, P., Sun, X., Han, L., Xiong, J., Wang, Q., Li, B., ... and Zhang, T.
[9] Geluvaraj, B., Satwik, P. M., and Kumar, T. A. (2019). The future of (2018). TStarBots: Defeating the cheating level builtin AI in StarCraft II
cybersecurity: Major role of artificial intelligence, machine learning, and in the full game. arXiv preprint arXiv:1809.07193.
deep learning in cyberspace. In International Conference on Computer [34] Pang, Z. J., Liu, R. Z., Meng, Z. Y., Zhang, Y., Yu, Y., and Lu, T.
Networks and Communication Technologies (pp. 739-747). Springer, (2018). On reinforcement learning for full-length game of StarCraft. arXiv
Singapore. preprint arXiv:1809.09095.
13
[35] Zambaldi, V., Raposo, D., Santoro, A., Bapst, V., Li, Y., Babuschkin, [58] Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A.
I., ... and Shanahan, M. (2018). Relational deep reinforcement learning. A. (2017). Deep reinforcement learning: A brief survey. IEEE Signal
arXiv preprint arXiv:1806.01830. Processing Magazine, 34(6), 26-38.
[36] Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Cas- [59] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wier-
taneda, A. G., ... and Sonnerat, N. (2018). Human-level performance in stra, D., and Riedmiller, M. (2013). Playing Atari with deep reinforcement
first-person multiplayer games with population-based deep reinforcement learning. arXiv preprint arXiv:1312.5602.
learning. arXiv preprint arXiv:1807.01281. [60] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015,
[37] OpenAI (2019, March 1). OpenAI Five. Retrieved from June). Trust region policy optimization. In International Conference on
https://openai.com/five/ Machine Learning (pp. 1889-1897).
[38] Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017, May). Deep [61] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov,
reinforcement learning for robotic manipulation with asynchronous off- O. (2017). Proximal policy optimization algorithms. arXiv preprint
policy updates. In 2017 IEEE International Conference on Robotics and arXiv:1707.06347.
Automation (ICRA) (pp. 3389-3396). IEEE. [62] Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A. M., Kakade,
[39] Isele, D., Rahimi, R., Cosgun, A., Subramanian, K., and Fujimura, S., ... and Abbeel, P. (2018). Variance reduction for policy gradient with
K. (2018, May). Navigating occluded intersections with autonomous action-dependent factorized baselines. arXiv preprint arXiv:1803.07246.
vehicles using deep reinforcement learning. In 2018 IEEE International [63] Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015). Prioritized
Conference on Robotics and Automation (ICRA) (pp. 2034-2039). IEEE. experience replay. arXiv preprint arXiv:1511.05952.
[40] Nguyen, T. T., Nguyen, N. D., Bello, F., and Nahavandi, S. (2019). A [64] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y.,
new tensioning method using deep reinforcement learning for surgical ... and Wierstra, D. (2015). Continuous control with deep reinforcement
pattern cutting. arXiv preprint arXiv:1901.03327. learning. arXiv preprint arXiv:1509.02971.
[41] Nguyen, N. D., Nguyen, T., Nahavandi, S., Bhatti, A., and Guest, G. [65] Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan,
(2019). Manipulating soft tissues by deep reinforcement learning for D., Muldal, A., ... and Lillicrap, T. (2018). Distributed distributional
autonomous robotic surgery. arXiv preprint arXiv:1902.05183. deterministic policy gradients. arXiv preprint arXiv:1804.08617.
[42] Mahmud, M., Kaiser, M. S., Hussain, A., and Vassanelli, S. (2018). [66] Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z.,
Applications of deep learning and reinforcement learning to biological Silver, D., and Kavukcuoglu, K. (2016). Reinforcement learning with
data. IEEE Transactions on Neural Networks and Learning Systems, unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397.
29(6), 2063-2079. [67] Wang, L., Trngren, M., and Onori, M. (2015). Current status and
[43] Popova, M., Isayev, O., and Tropsha, A. (2018). Deep reinforcement advancement of cyber-physical systems in manufacturing. Journal of
learning for de novo drug design. Science Advances, 4(7), eaap7885. Manufacturing Systems, 37, 517-527.
[44] He, Y., Yu, F. R., Zhao, N., Leung, V. C., and Yin, H. (2017). [68] Zhang, Y., Qiu, M., Tsai, C. W., Hassan, M. M., and Alamri, A. (2017).
Software-defined networks with mobile edge computing and caching for Health-CPS: Healthcare cyber-physical system assisted by cloud and big
smart cities: A big data deep reinforcement learning approach. IEEE data. IEEE Systems Journal, 11(1), 88-95.
Communications Magazine, 55(12), 31-37.
[69] Shakeel, P. M., Baskar, S., Dhulipala, V. S., Mishra, S., and Jaber, M.
[45] Hasselt, H. V., Guez, A., and Silver, D. (2016, February). Deep reinforce-
M. (2018). Maintaining security and privacy in health care system using
ment learning with double Q-learning. In Proceedings of the Thirtieth
learning based deep-Q-networks. Journal of Medical Systems, 42(10),
AAAI Conference on Artificial Intelligence (pp. 2094-2100). AAAI Press.
186.
[46] Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., and Freitas,
[70] Cintuglu, M. H., Mohammed, O. A., Akkaya, K., and Uluagac, A. S.
N. (2016, June). Dueling network architectures for deep reinforcement
(2017). A survey on smart grid cyber-physical system testbeds. IEEE
learning. In International Conference on Machine Learning (pp. 1995-
Communications Surveys and Tutorials, 19(1), 446-464.
2003).
[47] Zhu, H., Cao, Y., Wang, W., Jiang, T., and Jin, S. (2018). Deep [71] Chen, Y., Huang, S., Liu, F., Wang, Z., and Sun, X. (2018). Evaluation
reinforcement learning for mobile edge caching: Review, new features, of reinforcement learning-based false data injection attack to automatic
and open issues. IEEE Network, 32(6), 50-57. voltage control. IEEE Transactions on Smart Grid, 10(2), 2158-2169.
[48] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., [72] Ni, Z., and Paul, S. (2019). A multistage game in smart grid security: A
... and Kavukcuoglu, K. (2016, June). Asynchronous methods for deep reinforcement learning solution. IEEE Transactions on Neural Networks
reinforcement learning. In International Conference on Machine Learning and Learning Systems. DOI: 10.1109/TNNLS.2018.2885530.
(pp. 1928-1937). [73] Li, Y., Zhang, L., Zheng, H., He, X., Peeta, S., Zheng, T., and Li,
[49] Zhang, Y., Yao, J., and Guan, H. (2017). Intelligent cloud resource Y. (2018). Nonlane-discipline-based car-following model for electric
management with deep reinforcement learning. IEEE Cloud Computing, vehicles in transportation-cyber-physical systems. IEEE Transactions on
4(6), 60-69. Intelligent Transportation Systems, 19(1), 38-47.
[50] Zhu, J., Song, Y., Jiang, D., and Song, H. (2017). A new deep- [74] Li, C., and Qiu, M. (2019). Reinforcement Learning for Cyber-Physical
Q-learning-based transmission scheduling mechanism for the cognitive Systems: with Cybersecurity Case Studies. CRC Press.
Internet of Things. IEEE Internet of Things Journal, 5(4), 2375-2385. [75] Feng, M., and Xu, H. (2017, November). Deep reinforcement learning
[51] Cheng, M., Li, J., and Nazarian, S. (2018, January). DRL-cloud: Deep based optimal defense for cyber-physical system in presence of unknown
reinforcement learning-based resource provisioning and task scheduling cyber attack. In Computational Intelligence (SSCI), 2017 IEEE Sympo-
for cloud service providers. In Proceedings of the 23rd Asia and South sium Series on (pp. 1-8). IEEE.
Pacific Design Automation Conference (pp. 129-134). IEEE Press. [76] Akazaki, T., Liu, S., Yamagata, Y., Duan, Y., and Hao, J. (2018, July).
[52] Zhang, D., Han, X., and Deng, C. (2018). Review on the research and Falsification of cyber-physical systems using deep reinforcement learning.
practice of deep learning and reinforcement learning in smart grids. CSEE In International Symposium on Formal Methods (pp. 456-465). Springer,
Journal of Power and Energy Systems, 4(3), 362-370. Cham.
[53] He, X., Wang, K., Huang, H., Miyazaki, T., Wang, Y., and Guo, S. [77] Abbas, H., and Fainekos, G. (2012, October). Convergence proofs
(2018). Green resource allocation based on deep reinforcement learning for simulated annealing falsification of safety properties. In 2012 50th
in content-centric IoT. IEEE Transactions on Emerging Topics in Com- Annual Allerton Conference on Communication, Control, and Computing
puting. DOI: 10.1109/TETC.2018.2805718. (Allerton) (pp. 1594-1601). IEEE.
[54] He, Y., Liang, C., Yu, R., and Han, Z. (2018). Trust-based social [78] Sankaranarayanan, S., and Fainekos, G. (2012, April). Falsification of
networks with computing, caching and communications: A deep rein- temporal properties of hybrid systems using the cross-entropy method.
forcement learning approach. IEEE Transactions on Network Science and In Proceedings of the 15th ACM international conference on Hybrid
Engineering. DOI: 10.1109/TNSE.2018.2865183. Systems: Computation and Control (pp. 125-134). ACM.
[55] Luong, N. C., Hoang, D. T., Gong, S., Niyato, D., Wang, P., Liang, Y. [79] Ferdowsi, A., Challita, U., Saad, W., and Mandayam, N. B. (2018).
C., and Kim, D. I. (2019). Applications of deep reinforcement learning Robust deep reinforcement learning for security and safety in autonomous
in communications and networking: A survey. IEEE Communications vehicle systems. arXiv preprint arXiv:1805.00983.
Surveys and Tutorials. DOI: 10.1109/COMST.2019.2916583. [80] Wang, X., Jiang, R., Li, L., Lin, Y. L., and Wang, F. Y. (2019). Long
[56] Dai, Y., Xu, D., Maharjan, S., Chen, Z., He, Q., and Zhang, Y. (2019). memory is important: A test study on deep-learning based car-following
Blockchain and deep reinforcement learning empowered intelligent 5G model. Physica A: Statistical Mechanics and its Applications, 514, 786-
beyond. IEEE Network, 33(3), 10-17. 795.
[57] Watkins, C. J., and Dayan, P. (1992). Q-learning. Machine Learning, [81] Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory.
8(3-4), 279-292. Neural Computation, 9(8), 1735-1780.
14
[82] Gupta, A., and Yang, Z. (2018). Adversarial reinforcement learning [104] Malialis, K. (2014). Distributed Reinforcement Learning for Network
for observer design in autonomous systems under cyber attacks. arXiv Intrusion Response (Doctoral Dissertation, University of York, UK).
preprint arXiv:1809.06784. [105] Malialis, K., and Kudenko, D. (2015). Distributed response to network
[83] Abubakar, A., and Pranggono, B. (2017, September). Machine learning intrusions using multiagent reinforcement learning. Engineering Applica-
based intrusion detection system for software defined networks. In 2017 tions of Artificial Intelligence, 41, 270-284.
Seventh International Conference on Emerging Security Technologies [106] Sutton, R. S., Barto, A. G. (1998). Introduction to Reinforcement
(EST) (pp. 138-143). IEEE. Learning. MIT Press Cambridge, MA, USA.
[84] Jose, S., Malathi, D., Reddy, B., and Jayaseeli, D. (2018, April). A [107] Yau, D. K., Lui, J. C., Liang, F., and Yam, Y. (2005). Defending
survey on anomaly based host intrusion detection system. In Journal of against distributed denial-of-service attacks with max-min fair server-
Physics: Conference Series (vol. 1000, no. 1, p. 012049). IOP Publishing. centric router throttles. IEEE/ACM Transactions on Networking, 13(1),
[85] Roshan, S., Miche, Y., Akusok, A., and Lendasse, A. (2018). Adaptive 29-42.
and online network intrusion detection system using clustering and [108] Bhosale, R., Mahajan, S., and Kulkarni, P. (2014). Cooperative machine
extreme learning machines. Journal of the Franklin Institute, 355(4), learning for intrusion detection system. International Journal of Scientific
1752-1779. and Engineering Research, 5(1), 1780-1785.
[86] Dey, S., Ye, Q., and Sampalli, S. (2019). A machine learning based [109] Herrero, A., and Corchado, E. (2009). Multiagent systems for network
intrusion detection scheme for data fusion in mobile clouds involving intrusion detection: A review. In Computational Intelligence in Security
heterogeneous client networks. Information Fusion, 49, 205-215. for Information Systems (pp. 143-154). Springer, Berlin, Heidelberg.
[87] Papamartzivanos, D., Mrmol, F. G., and Kambourakis, G. (2019). In- [110] Detwarasiti, A., and Shachter, R. D. (2005). Influence diagrams for
troducing deep learning self-adaptive misuse network intrusion detection team decision analysis. Decision Analysis, 2(4), 207-228.
systems. IEEE Access, 7, 13546-13560. [111] Shamshirband, S., Patel, A., Anuar, N. B., Kiah, M. L. M., and
[88] Haider, W., Creech, G., Xie, Y., and Hu, J. (2016). Windows based data Abraham, A. (2014). Cooperative game theoretic approach using fuzzy
sets for evaluation of robustness of host based intrusion detection systems Q-learning for detecting and preventing intrusions in wireless sensor
(IDS) to zero-day and stealth attacks. Future Internet, 8(3), 29. networks. Engineering Applications of Artificial Intelligence, 32, 228-
[89] Deshpande, P., Sharma, S. C., Peddoju, S. K., and Junaid, S. (2018). 241.
HIDS: A host based intrusion detection system for cloud computing [112] Muoz, P., Barco, R., and de la Bandera, I. (2013). Optimization of load
environment. International Journal of System Assurance Engineering and balancing using fuzzy Q-learning for next generation wireless networks.
Management, 9(3), 567-576. Expert Systems with Applications, 40(4), 984-994.
[90] Nobakht, M., Sivaraman, V., and Boreli, R. (2016, August). A host- [113] Shamshirband, S., Anuar, N. B., Kiah, M. L. M., and Patel, A. (2013).
based intrusion detection and mitigation framework for smart home IoT An appraisal and design of a multi-agent system based cooperative wire-
using OpenFlow. In 2016 11th International conference on availability, less intrusion detection computational intelligence technique. Engineering
reliability and security (ARES) (pp. 147-156). IEEE. Applications of Artificial Intelligence, 26(9), 2105-2127.
[91] Resende, P. A. A., and Drummond, A. C. (2018). A survey of random [114] Varshney, S., and Kuma, R. (2018, January). Variants of LEACH
forest based methods for intrusion detection systems. ACM Computing routing protocol in WSN: A comparative analysis. In 2018 8th Inter-
Surveys (CSUR), 51(3), 48. national Conference on Cloud Computing, Data Science and Engineering
[92] Kim, G., Yi, H., Lee, J., Paek, Y., and Yoon, S. (2016). LSTM-based (Confluence) (pp. 199-204). IEEE.
system-call language modeling and robust ensemble method for designing [115] Roy, S., Ellis, C., Shiva, S., Dasgupta, D., Shandilya, V., and Wu, Q.
host-based intrusion detection systems. arXiv preprint arXiv:1611.01726. (2010, January). A survey of game theory as applied to network security.
[93] Chawla, A., Lee, B., Fallon, S., and Jacob, P. (2018, September). In 2010 43rd Hawaii International Conference on System Sciences (pp.
Host based intrusion detection system with combined CNN/RNN model. 1-10). IEEE.
In Joint European Conference on Machine Learning and Knowledge [116] Shiva, S., Roy, S., and Dasgupta, D. (2010, April). Game theory for
Discovery in Databases (pp. 149-158). Springer, Cham. cyber security. In Proceedings of the Sixth Annual Workshop on Cyber
[94] Agah, A., Das, S. K., Basu, K., and Asadi, M. (2004, August). Intrusion Security and Information Intelligence Research (p. 34). ACM.
detection in sensor networks: A non-cooperative game approach. In Third [117] Ramachandran, K., and Stefanova, Z. (2016). Dynamic game theories
IEEE International Symposium on Network Computing and Applications, in cyber security. In Proceedings of International Conference of Dynamic
(pp. 343-346). IEEE. Systems and Applications, 7, 303-310.
[95] Xu, X., Sun, Y., and Huang, Z. (2007, April). Defending DDoS attacks [118] Wang, Y., Wang, Y., Liu, J., Huang, Z., and Xie, P. (2016, June). A
using hidden Markov models and cooperative reinforcement learning. In survey of game theoretic methods for cyber security. In 2016 IEEE First
Pacific-Asia Workshop on Intelligence and Security Informatics (pp. 196- International Conference on Data Science in Cyberspace (DSC) (pp. 631-
207). Springer, Berlin, Heidelberg. 636). IEEE.
[96] Servin, A., and Kudenko, D. (2008). Multi-agent reinforcement lrning [119] Zhu, Q., and Rass, S. (2018, October). Game theory meets network se-
for intrusion detection. In Adaptive Agents and Multi-Agent Systems III. curity: A tutorial. In Proceedings of the 2018 ACM SIGSAC Conference
Adaptation and Multi-Agent Learning (pp. 211-223). Springer, Berlin, on Computer and Communications Security (pp. 2163-2165). ACM.
Heidelberg. [120] Mpitziopoulos, A., Gavalas, D., Konstantopoulos, C., and Pantziou, G.
[97] Janagam, A., and Hossen, S. (2018). Analysis of Network Intrusion De- (2009). A survey on jamming attacks and countermeasures in WSNs.
tec tion System with Machine Learning Algorithms (Deep Reinforcement IEEE Communications Surveys and Tutorials, 11(4), 42-56.
Learning Algorithm) (Dissertation, Blekinge Institute of Technology, [121] Hu, S., Yue, D., Xie, X., Chen, X., and Yin, X. (2018). Resilient
Sweden). event-triggered controller synthesis of networked control systems under
[98] Xu, X., and Xie, T. (2005, August). A reinforcement learning approach periodic DoS jamming attacks. IEEE Transactions on Cybernetics, DOI:
for host-based intrusion detection using sequences of system calls. 10.1109/TCYB.2018.2861834.
In International Conference on Intelligent Computing (pp. 995-1003). [122] Boche, H., and Deppe, C. (2019). Secure identification under passive
Springer, Berlin, Heidelberg. eavesdroppers and active jamming attacks. IEEE Transactions on Infor-
[99] Sutton, R. S. (1988). Learning to predict by the methods of temporal mation Forensics and Security, 14(2), 472-485.
differences. Machine Learning, 3(1), 9-44. [123] Wu, Y., Wang, B., Liu, K. R., and Clancy, T. C. (2012). Anti-
[100] Xu, X. (2006, September). A sparse kernel-based least-squares tem- jamming games in multi-channel cognitive radio networks. IEEE Journal
poral difference algorithm for reinforcement learning. In International on Selected Areas in Communications, 30(1), 4-15.
Conference on Natural Computation (pp. 47-56). Springer, Berlin, Hei- [124] Singh, S., and Trivedi, A. (2012, September). Anti-jamming in cogni-
delberg. tive radio networks using reinforcement learning algorithms. In Wireless
[101] Xu, X., and Luo, Y. (2007, June). A kernel-based reinforcement and Optical Communications Networks (WOCN), 2012 Ninth Interna-
learning approach to dynamic behavior modeling of intrusion detection. tional Conference on (pp. 1-5). IEEE.
In International Symposium on Neural Networks (pp. 455-464). Springer, [125] Gwon, Y., Dastangoo, S., Fossa, C., and Kung, H. T. (2013, October).
Berlin, Heidelberg. Competing mobile network game: Embracing antijamming and jamming
[102] Xu, X. (2010). Sequential anomaly detection based on temporal- strategies with reinforcement learning. In 2013 IEEE Conference on
difference learning: Principles, models and case studies. Applied Soft Communications and Network Security (CNS) (pp. 28-36). IEEE.
Computing, 10(3), 859-867. [126] Conley, W. G., and Miller, A. J. (2013, November). Cognitive jamming
[103] Deokar, B., and Hazarnis, A. (2012). Intrusion detection system using game for dynamically countering ad hoc cognitive radio networks. In
log files and reinforcement learning. International Journal of Computer MILCOM 2013-2013 IEEE Military Communications Conference (pp.
Applications, 45(19), 28-35. 1176-1182). IEEE.
15
[127] Dabcevic, K., Betancourt, A., Marcenaro, L., and Regazzoni, C. S. detection. In GLOBECOM 2017-2017 IEEE Global Communications
(2014, May). A fictitious play-based game-theoretical approach to alle- Conference (pp. 1-6). IEEE.
viating jamming attacks for cognitive radios. In 2014 IEEE International [150] Li, Y., Liu, J., Li, Q., and Xiao, L. (2015, April). Mobile cloud of-
Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. floading for malware detections with learning. In 2015 IEEE Conference
8158-8162). IEEE. on Computer Communications Workshops (INFOCOM WKSHPS) (pp.
[128] Slimeni, F., Scheers, B., Chtourou, Z., and Le Nir, V. (2015, May). Jam- 197-201). IEEE.
ming mitigation in cognitive radio networks using a modified Q-learning [151] Salahuddin, M. A., Al-Fuqaha, A., and Guizani, M. (2015). Software-
algorithm. In 2015 International Conference on Military Communications defined networking for RSU clouds in support of the internet of vehicles.
and Information Systems (ICMCIS) (pp. 1-7). IEEE. IEEE Internet of Things Journal, 2(2), 133-144.
[129] Xiao, L., Lu, X., Xu, D., Tang, Y., Wang, L., and Zhuang, W. [152] Huang, R., Chu, X., Zhang, J., and Hu, Y. H. (2015). Energy-
(2018). UAV relay in VANETs against smart jamming with reinforcement efficient monitoring in software defined wireless sensor networks using
learning. IEEE Transactions on Vehicular Technology, 67(5), 4087-4097. reinforcement learning: A prototype. International Journal of Distributed
[130] Xiao, L., Wan, X., Dai, C., Du, X., Chen, X., and Guizani, M. Sensor Networks, 11(10), 360428.
(2018). Security in mobile edge caching with reinforcement learning. [153] Kim, S., Son, J., Talukder, A., and Hong, C. S. (2016, January). Con-
IEEE Wireless Communications, 25(3), 116-122. gestion prevention mechanism based on Q-leaning for efficient routing
[131] Aref, M. A., Jayaweera, S. K., and Machuzak, S. (2017, March). Multi- in SDN. In 2016 International Conference on Information Networking
agent reinforcement learning based cognitive anti-jamming. In Wireless (ICOIN) (pp. 124-128). IEEE.
Communications and Networking Conference (WCNC), 2017 IEEE (pp. [154] Lin, S. C., Akyildiz, I. F., Wang, P., and Luo, M. (2016, June).
1-6). IEEE. QoS-aware adaptive routing in multi-layer hierarchical software defined
[132] Machuzak, S., and Jayaweera, S. K. (2016, July). Reinforcement networks: A reinforcement learning approach. In 2016 IEEE International
learning based anti-jamming with wideband autonomous cognitive radios. Conference on Services Computing (SCC) (pp. 25-33). IEEE.
In 2016 IEEE/CIC International Conference on Communications in China [155] Mestres, A., Rodriguez-Natal, A., Carner, J., Barlet-Ros, P., Alarcn, E.,
(ICCC) (pp. 1-5). IEEE. Sol, M., ... and Estrada, G. (2017). Knowledge-defined networking. ACM
[133] Felice M.D., Bedogni L., Bononi L. (2019) Reinforcement learning- SIGCOMM Computer Communication Review, 47(3), 2-10.
based spectrum management for cognitive radio networks: A literature [156] Han, Y., Rubinstein, B. I., Abraham, T., Alpcan, T., De Vel, O.,
review and case study. In: Zhang W. (eds) Handbook of Cognitive Radio Erfani, S., ... and Montague, P. (2018, October). Reinforcement learning
(pp. 1849-1886). Springer, Singapore. for autonomous defence in software-defined networking. In International
[134] Attar, A., Tang, H., Vasilakos, A. V., Yu, F. R., and Leung, V. C. (2012). Conference on Decision and Game Theory for Security (pp. 145-165).
A survey of security challenges in cognitive radio networks: Solutions and Springer, Cham.
future research directions. Proceedings of the IEEE, 100(12), 3172-3186. [157] Lantz, B., and O’Connor, B. (2015, August). A mininet-based virtual
[135] Wang, B., Wu, Y., Liu, K. R., and Clancy, T. C. (2011). An anti- testbed for distributed SDN development. In ACM SIGCOMM Computer
jamming stochastic game for cognitive radio networks. IEEE Journal on Communication Review (Vol. 45, No. 4, pp. 365-366). ACM.
Selected Areas in Communications, 29(4), 877-889. [158] Zhu, M., Hu, Z., and Liu, P. (2014, November). Reinforcement learning
[136] Littman, M. L. (1994). Markov games as a framework for multi- algorithms for adaptive cyber defense against Heartbleed. In Proceedings
agent reinforcement learning. In Proceedings of the 11th International of the First ACM Workshop on Moving Target Defense (pp. 51-58). ACM.
Conference on Machine Learning (pp. 157-163). [159] Wang, J., Zhao, M., Zeng, Q., Wu, D., and Liu, P. (2015, June). Risk
[137] Xiao, L., Li, Y., Liu, J., and Zhao, Y. (2015). Power control with assessment of buffer ”Heartbleed” over-read vulnerabilities. In 2015 45th
reinforcement learning in cooperative cognitive radio networks against Annual IEEE/IFIP International Conference on Dependable Systems and
jamming. The Journal of Supercomputing, 71(9), 3237-3257. Networks (pp. 555-562). IEEE.
[138] Yang, D., Xue, G., Zhang, J., Richa, A., and Fang, X. (2013). Coping [160] Luo, B., Yang, Y., Zhang, C., Wang, Y., and Zhang, B. (2018, June).
with a smart jammer in wireless networks: A Stackelberg game approach. A survey of code reuse attack and defense. In International Conference
IEEE Transactions on Wireless Communications, 12(8), 4038-4047. on Intelligent and Interactive Systems and Applications (pp. 782-788).
[139] Bowling, M., and Veloso, M. (2002). Multiagent learning using a Springer, Cham.
variable learning rate. Artificial Intelligence, 136(2), 215-250. [161] Elderman, R., Pater, L. J., Thie, A. S., Drugan, M. M., and Wiering,
[140] Han, G., Xiao, L., and Poor, H. V. (2017, March). Two-dimensional M. (2017, January). Adversarial reinforcement learning in a cyber se-
anti-jamming communication based on deep reinforcement learning. In curity simulation. In International Conference on Agents and Artificial
Proceedings of the 42nd IEEE International Conference on Acoustics, Intelligence (ICAART), (2) (pp. 559-566).
Speech and Signal Processing, (pp. 2087-2091). IEEE. [162] Chung, K., Kamhoua, C. A., Kwiat, K. A., Kalbarczyk, Z. T., and
[141] Liu, X., Xu, Y., Jia, L., Wu, Q., and Anpalagan, A. (2018). Anti- Iyer, R. K. (2016, January). Game theory with learning for cyber
jamming communications using spectrum waterfall: A deep reinforcement security monitoring. In 2016 IEEE 17th International Symposium on High
learning approach. IEEE Communications Letters, 22(5), 998-1001. Assurance Systems Engineering (HASE) (pp. 1-8). IEEE.
[142] Chen, W., and Wen, X. (2016, January). Perceptual spectrum waterfall [163] Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. (2015). Action-
of pattern shape recognition algorithm. In 2016 18th International Confer- conditional video prediction using deep networks in atari games. In
ence on Advanced Communication Technology (ICACT) (pp. 382-389). Advances in Neural Information Processing Systems (pp. 2863-2871).
IEEE. [164] Mathieu, M., Couprie, C., and LeCun, Y. (2015). Deep multi-
[143] Zeng, K., Govindan, K., and Mohapatra, P. (2010). Non-cryptographic scale video prediction beyond mean square error. arXiv preprint
authentication and identification in wireless networks. IEEE Wireless arXiv:1511.05440.
Communications, 17(5), 56-62. [165] Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018, May).
[144] Xiao, L., Li, Y., Liu, G., Li, Q., and Zhuang, W. (2015, December). Neural network dynamics for model-based deep reinforcement learning
Spoofing detection with reinforcement learning in wireless networks. In with model-free fine-tuning. In 2018 IEEE International Conference on
Global Communications Conference (GLOBECOM), 2015 IEEE (pp. 1- Robotics and Automation (ICRA) (pp. 7559-7566). IEEE.
5). IEEE. [166] Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P.
[145] Xiao, L., Li, Y., Han, G., Liu, G., and Zhuang, W. (2016). PHY- I., Rohlfshagen, P., ... and Colton, S. (2012). A survey of Monte Carlo
layer spoofing detection with reinforcement learning in wireless networks. tree search methods. IEEE Transactions on Computational Intelligence
IEEE Transactions on Vehicular Technology, 65(12), 10037-10047. and AI in Games, 4(1), 1-43.
[146] Sutton, R. S. (1990, June). Integrated architecture for learning, plan- [167] Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel, P. (2016).
ning, and reacting based on approximating dynamic programming. In Value iteration networks. In Advances in Neural Information Processing
Proceedings of the Seventh International Conference on Machine Learn- Systems (pp. 2154-2162).
ing (pp. 216-224). Morgan Kaufmann Publishers Inc.. [168] Pascanu, R., Li, Y., Vinyals, O., Heess, N., Buesing, L., Racanire, S.,
[147] Sun, X., Dai, J., Liu, P., Singhal, A., and Yen, J. (2018). Using Bayesian ... and Battaglia, P. (2017). Learning model-based planning from scratch.
networks for probabilistic identification of zero-day attack paths. IEEE arXiv preprint arXiv:1707.06170.
Transactions on Information Forensics and Security, 13(10), 2506-2521. [169] Giles, M. (2019, January 4). Five emerging cyber-threats to worry about
[148] Afek, Y., Bremler-Barr, A., and Feibish, S. L. (2019). Zero-day in 2019. MIT Technology Review.
signature extraction for high-volume attacks. IEEE/ACM Transactions on [170] Behzadan, V., and Munir, A. (2017, July). Vulnerability of deep
Networking. DOI: 10.1109/TNET.2019.2899124. reinforcement learning to policy induction attacks. In International Con-
[149] Wan, X., Sheng, G., Li, Y., Xiao, L., and Du, X. (2017, December). ference on Machine Learning and Data Mining in Pattern Recognition
Reinforcement learning based mobile offloading for cloud-based malware (pp. 262-275). Springer, Cham.
16