Dinesh BTP Thesis 1

CVAE DDQN and PPO Actor Critic based IDS for Smart
Grid Environment
Thesis Submitted in partial fulfillment of

the requirements for the degree of
Bachelor of Technology
in
Computer Science And Engineering
by
Dinesh Mohanty
Under the guidance of

Dr. Padmalochan Bera
SCHOOL OF ELECTRICAL SCIENCES

INDIAN INSTITUTE OF TECHNOLOGY
BHUBANESWAR
May 2021
©2021 Dinesh Mohanty. All rights reserved.
Certificate
Candidate’s Declaration
I hereby declare that the work presented in the thesis entitled ”CVAE DDQN and PPO Actor
Critic based IDS For Smart Grid Environment” in fulfillment of the requirements for the award
of the Degree of Bachelor of Technology and submitted in the School of Electrical Sciences of the
Indian Institute of Technology Bhubaneswar is an authentic record of my own work carried out under
the supervision of Prof. Dr. Padmalochan Bera, School of Electrical Sciences, Indian Institute of
Technology Bhubaneswar.
The matter presented in this thesis has not been submitted by me for the award of any other degree
of this or any other Institute/University.
Dinesh Mohanty
(Roll no. 17CS01051)
This is to certify that the above statement made by the candidate is true to the best of our knowledge
and belief.
Dr. Padmalochan Bera

Place: School of Electrical Sciences
Date: Indian Institute of Technology Bhubaneswar
i
Acknowledgement
It is a great honor to express my most profound respect and sense of gratitude to my B. Tech Project
supervisor Dr. Padmalochan Bera for his knowledge, insights, expertise, guidance, enthusiastic
involvement, and persistent encouragement during the planning and development of this thesis work.
I also gratefully acknowledge his meticulous efforts in thoroughly going through and improving many
of my research manuscripts without which this work could not have been completed.
I am highly obliged to all the professors of the Computer Science Department, including Dr.
Manoranjan Satpathy, Dr. D. P. Dogra, Dr. Joy Chandra Mukherjee, Dr. Srinivas
Pinisetty, and Dr. Sudipta Saha for providing all the guidance, help, and encouragement during
my last four years at college.
I am extremely grateful to my parents and grandparents for their moral support, love, encour-
agement, and blessings to complete this task. I am especially thankful to Kamalkanta Sethi for all
the support and mentoring, as well as providing excellent exposure in the field of research.
I thank all of my co-authors: Sai Prasath, Kamalakanta Sethi, Rahul Kumar, and Dr.
Padmalochan Bera for correcting and inspiring most of the contents of this thesis.
I would like to express my deep and sincere thanks to my friends and all other persons whose names
do not appear here, for helping me either directly or indirectly in all even and odd times.
Finally, I am indebted and grateful to the Almighty for helping me in this endeavor.
ii
Abstract
—The smart grid is a cyber-physical system that includes hardware, software and physical components
appropriately integrated, interacting and interrelating to sense the fluctuating state of the physical
world . Smart grid provides on-demand electricity to the customers from centralized and distributed
generation stations using information and communication technologies.Power companies can deliver
reliable power at reduced cost and can control the power demand.Security is considered one of the
most significant concerns in smart grid system.Smart grids have been able to improve and enhance the
capabilities of conventional power networks, but they make the latter more prone to cyber-attacks.
This may lead to the breakdown of integrity and confidentiality of the network. Intrusion Detection
System (IDS) has proven to be one of the significant ways of providing safe and robust services in a
smart grid environment. Through my work, I propose 2 advanced Reinforcement Learning based in-
trusion detection system frameworks for the smart grid by utilising the three-layer architecture of the
smart grid system. The proposed framework has an IDS in each HAN and NAN and many IDS sensors
in WAN. All the malicious activities will be sent to the central management unit which then corre-
lates and investigates alerts produced by various distributed sensors using anomaly-based detection
methodology. We have used robust and low false alarm creating IDSs based on Deep Reinforcement
Learning along with Generative models and Proximal Policy Optimisation Methods. Unfortunately,
there is a lack of smart grid-based Intrusion Dataset which has been compensated by famous con-
ventional network-based NSL-KDD Dataset and the cloud-based ISOT Dataset. Upon testing, the
proposed models showed excellent novel attack detection capabilities and significant performance met-
rics. Finally, the real world adaptiveness of our models was evaluated using changes in the attack
pattern on a day-wise basis and by checking our models against specific attack types.
iii
Contents
Certificate i
Acknowledgement ii
Abstract iii
List of Figures vii
List of Tables viii
1 Introduction 1
2 Related work based on literature study 3
3 Background 7
3.1 Double Deep Q Learning (DDQN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Deep Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.3 Double Deep Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Curiosity Driven Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Proximal Policy Optimisation and Actor Critic methods . . . . . . . . . . . . . . . . . 10
3.3.1 Vanilla Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.2 Trust Region Policy Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.3 Proximal Policy Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.4 Actor Critic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
iv
Contents
3.4 Smart Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Types of attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 IDS specific to Smart Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7.1 Intrusion Detection in Home Area Network . . . . . . . . . . . . . . . . . . . . 15
3.7.2 Intrusion Detection in Neighborhood Area Network . . . . . . . . . . . . . . . 15
3.7.3 Intrusion Detection in Wide Area Network . . . . . . . . . . . . . . . . . . . . 16
3.7.4 Deploying IDS in the Smart Grid . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Dataset Description 17
4.1 NSLKDD Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 ISOT-CID dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.1 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.2 Cloud environment and Data collection outline . . . . . . . . . . . . . . . . . . 18
4.2.3 Attack categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.4 Network Traffic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2.5 File Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 Tranalyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Processing of Tranalyzer output . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Proposed Intrusion Detection System 23
5.1 Our Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Architecture Modifications for PPO Powered Actor Critic Model . . . . . . . . . . . . 27
6 Experimental results and discussion 30
v
Contents
6.1 Performance of model on continuously changing attack types . . . . . . . . . . . . . . 30
6.2 Attack specific classification on NSL-KDD Dataset . . . . . . . . . . . . . . . . . . . . 31
7 Conclusion and Future Work 32
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8 Publications 34
9 References 35
vi
List of Figures
3.1 benifits of smart grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Layer architecture of smart grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 Distribution of labelled data in the training and testing dataset in NSLKDD . . . . . 17
4.2 Cloud Environment Architecture used for ISOT Data collection . . . . . . . . . . . . . 19
4.3 File Structure of ISOT CID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1 Cloud IDS Deployment Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Pictorial representation for agent result calculation using DDQN and prioritised learn-
ing. CP1 , CP2 ,...CPk represents the classifier prediction in form of 1(for attack) and
0(for normal) for classifier C1 , C2 ,...Ck respectively . . . . . . . . . . . . . . . . . . . . 25
5.3 Pictorial representation of PPO based Actor Critic Model deployment . . . . . . . . . 27
5.4 Pictorial representation of PPO based Actor Critic Model . . . . . . . . . . . . . . . . 28
vii
List of Tables
3.1 security threats for smart grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 ISOT dataset attack type distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 ISOT dataset distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.1 Model Performance on ISOT CID and NSLKDD Dataset . . . . . . . . . . . . . . . . 30
6.2 Performance of DDQN CVAE model on daily changing attack type . . . . . . . . . . . 31
6.3 Performance of PPO Actor Critic model on daily changing attack type . . . . . . . . . 31
6.4 Performance of CVAE DDQN model on daily changing attack type . . . . . . . . . . . 31
6.5 Performance of PPO Actor Critic model on daily changing attack type . . . . . . . . . 31
viii
Chapter 1
Introduction
Today, electric power distribution is made possible by the power distribution grid; a system of trans-
mission mediums that allows electricity to be transferred from the point of generation to consumers
like homes, offices or industries. The electrical grid is expected to evolve to a new grid paradigm:
the smart grid that uses two-way flows of electricity and information to create an automated and
distributed advanced energy delivery network. A smart grid is an electricity network that can in-
telligently integrate the actions of all users connected to it – generators, consumers, and those that
do both – in order to optimize the production, supply as well as the consumption of electricity and
provide several features to its customers [1]. Security is considered one of the most significant con-
cerns in smart grid system. Smart grid domain can be divided into two main components: system,
and network. System component includes service providers, electricity utility operation center, smart
meter, electrical household appliances and renewable energy resources. Network component consists
three types of communication that are incorporated in smart grid operation: Home Area Network
(HAN) or Business Area Network (BAN); Neighborhood Area Network (NAN) and Wide Area Net-
work (WAN). HANs connect smart devices in the home with a smart meter and can communicate
using Bluetooth, ZigBee, Wireless, or Wired Ethernet. The NAN is a widespread network that col-
lects service and metering information from many HANs that are geologically adjacent to each other.
WAN provides wireless and wired communication between distributed grid, NANs, and the utility.
The WAN communication can be achieved using Fiber Optics, WiMAX or 4G/3G/GSM/LTE [3].
Smart grid improves and enhances the capabilities of conventional power network but at the same
time making it more vulnerable to different types of attack. These weaknesses permit an attacker to
breakdown integrity, confidentiality and allow access to the network. More severe vulnerabilities are
as follows: physical security; customer security; a greater number of intelligent devices; power system
lifetime; trust between traditional power equipment; different team’s background; inherit IP-based
vulnerability [2]. Cyber security is more exposed as a society become progressively dependent on
1
Chapter 1 - Introduction
computerized systems for industry, medicine, and finance, etc. Cyber security can be improved by
utilizing intrusion detection to look for unseen attack patterns and actions. Intrusion detection also
plays a significant role in computer forensics to identify successful breaches. Accident and undesired
conditions can also be detected using intrusion detection strategy. The earlier cyber-physical system
is considered safe but due to vast diversity domain of cyber-physical system, and everything connected
to the internet makes cyber-physical system unsafe. There are many attacks successfully launched
against cyberphysical systems. Hackers launched many malicious attacks on power networks. Several
security weaknesses exist in all kind of cyber-physical systems. A lot of research on going about smart
grid system implementation. Most of this research not focusing the security requirement for the smart
grid system. Reliability is one of the important designed issues in the development of the smart grid
system. IDS has engrossed importance in the smart grid system. Many kinds of literature considered
performance as detection ratio of intrusion detection, none of them covered about the latency and
delays concern of intrusion detection as some smart grid system would not effort delay after a certain
deadline.
Researchers also classify IDS as (1) signature-based systems and (2) anomaly-based systems [17].
Signature-based systems use a repository of signatures of already identified malicious patterns to
detect an attack. They are efficient in detecting known attacks but fail in case of unforeseen attacks.
In contrast, anomaly-based IDS detects intrusions when the set of activities of any user deviates from
its normal functionalities. Although these systems can detect a zero-day attack, they tend to generate
a lot of false alerts leading to a high false-positive rate (FPR).
In the last decade, many researchers proposed traditional machine learning and deep learning-based
IDS system that show excellent performance [5,12,25]. However, they also have several limitations that
include: 1) lack of proper adaptivity towards novel attacks and changes in attack patterns and their
ability to identify them with high accuracy and low False Positive Rate (FPR), 2) require frequent
human intervention for training that introduces more vulnerability and, thereby, affects the model’s
performance, 3) low speed of adaptation due to lack of curiousity based learning .
My research aims at dealing with all such issues through the use of Deep Reinforcement Learning
algorithms combined with intelligent flow analysis and smart experience replay powered by the newly
introduced Genarative Auto Encoders. I also propose the use of Proximal Policy Optimised Actor
Critic based Intrusion Detection System for the same. Our proposed IDSs can detect novel attacks and
adapt to every possible change in attack patterns in a smart grid environment ,with very little human
interaction. We also evaluated our proposed system using ISOT-CID , a real-time cloud intrusion
dataset, and NSL-KDD, a very popular conventional network dataset. The results show that our
proposed model effectively achieves right balance between accuracy and false-positive rate.
2
Chapter 2
Related work based on literature
study
In.this chapter, I present the related works on network intrusions for Cloud environments using ma-
chine learning technique.
Chiba et al. [5] proposed a NIDS framework for cloud environment. Their proposed NIDS framework
used cooperative and hybrid approach to detect network intrusions in cloud. It used SNORT (a
signature based technique) at the front end and back propagation neural network (an anomaly based
technique) at the back end in order to detect external and internal attacks respectively. NIDS deployed
on all processing servers work in a cooperative way to detect coordinated attacks by sharing alerts to
each other. However, authors did not validate the efficacy of their model using any network intrusion
dataset.
Li et al. [12] proposed a scalable and distributed NIDS architecture for cloud platform. The proposed
NIDS consisted various nodes and each node ran back propagation based ANN. The proposed NIDS
has been evaluated with KDD dataset on a physical cloud testbed. The experimental results show
average detection rate of 99% and average detection time of 37.1 second. The major limitation in this
work is with ANN which takes high training time for large dataset. Also the simulated dataset does
not represent a true cloud dataset.
K. Sethi et al. [21] presented a cloud NIDS using reinforcement learning. Their IDS can detect new
attacks in cloud and also adaptive to attack pattern changes in cloud. They validated the efficacy of
their model using a conventional network dataset (UNSW) instead of cloud network datasets. Their
model maintained a balance between accuracy and FPR. The main problem in their work is that they
3
Chapter 2 - Related work based on literature study
verified their model with UNSW dataset, which does not reflect a real cloud environment.
Kholidy et al. [10] created a new dataset called cloud intrusion detection dataset (CIDD) for cloud
IDS evaluation. The dataset includes both knowledge and behavior based audit data. To build the
dataset, the authors implemented a log analyzer and correlator system (LACS) that extracts and
correlates user audits from a set of log files from DARPA dataset. The main issue with this dataset
is that its main focus is on detecting masquerade attacks. Also it does not consider network flows
involving hypervisor. Moreover ,the dataset is not publicly available.
In summary, state-of-the-art works don’t apply Deep Reinforcement Learning for Cloud intrusion
detection system though a few recent attempts are present on conventional network applications. Also
the existing works do not use cloud specific datasets and thereby , are not capable of representing
real cloud environment. Aldribi et. al [4] introduced the first publicly available cloud dataset called
ISOT-CID which is collected from a real cloud computing environment. The dataset consists of a
wide variety of traditional network attacks as well as cloud specific attacks. The author discusses
a hypervisor-based cloud IDS involving novel feature extraction which obtains an accuracy (best
performance) of 95.95% with an FPR of 5.77% for Phase-2 and hypervisor-B portion of the dataset.
We have used CVAE-DDQN based model for detecting Intrusions with very high accuracy and low
FPR rates . Also our model shows highly robust traits.
Boumkheld et al [16] proposed an anomaly based IDS for Advanced Metering Infrastructure (AMI)
using AODV Protocol to detect blackhole attacks. It achieved 100% TPR, 99% accuracy, and 66%
Precision. They worked on simulated data. Faisal et al [13] had also proposed an anomaly based IDS
for AMI using MOA software to detect DOS, R2L, U2R, and Probing attacks. It achieved 94.67%
accuracy and 3.31% FPR. They worked on KDD CUP 1999 and NSLKDD dataset.
Goldberg et al [7] had designed an anomaly based IDS for SCADA component of the smart grid using
Modbus protocol and software tools such as wireshark, pcapy and Impacket to detect several attacks.
They were able to achieve 100% precision, 0% FNR, 100% accuracy and 0% FPR. They worked on self
generated real world dataset. Feng et al. [28] had also designed an hybrid IDS for SCADA component
of the smart grid using Profinet protocol and Snort software tool to detect reconnaissance, protocol
anomalies and DoS attacks.
Kwon et al. [26] had targeted the the smart grid substation using MMS and IEC 61850 protocol
and software tools such as wireshark to detect DoS, port scanning, portable executable, Goose, MMS,
and SNMP attacks. They were able to achieve 100% precision, 1.1% FNR, 98.9% TPR and 0% FPR.
They worked on real data from a substation in South Korea. The proposed IDS was specification
based. Yoo et al [27] had also designed anomaly based IDS for smart grid substation using MMS and
4
GOOSE protocol and software tools such as the WEKA framework to detect several attacks. They
were able to achieve an average of 3.5% FPR. They also worked on real data from a substation.
Pan et al. [18] proposed an hybrid IDS for Synchrophasor component of the smart grid using Snort
and OpenPDC software tools to detect Single line to ground faults, Replay, Command Injection and
Disable Relay attacks. They were able to achieve 90.4% accuracy. They worked on a simulated dataset.
Yang et al [6] also targeted the Synchrophasor component of the smart grid using IEEE C37.118
protocol and software tools such as ITACA, Nmap, Metasploit and hping to detect Reconnaissance,
and DoS attacks. They were able to achieve 0% FPR. Their IDS was also specification based.
Rupam Kr. Sharma etc. in [4] analyzed various machine learning techniques for intrusion detection
system using KDD’99 dataset. High detection accuracy can be accomplished by using machine learning
techniques, however, due to poison learning in machine learning algorithms, there may exhibit some
weaknesses that might cause misclassification of network data. Intrusion detection approaches in
advanced metering infrastructure (AMI) was explained in [5].
The proposed state-based approach calculates security metrics using attack steps to achieve a high
degree of confidence in intrusion detection in AMI.
Two-tier intrusion detection framework was suggested in [6] for advanced metering infrastructure.
This structure achieved high detection rate with the low rate of false alarm.
Intrusion tolerance techniques used in [7] improve the availability of smart grid. The proposed
intrusion tolerance system evaluated in the event of DoS attacks and compared the results with two
existing intrusion tolerance system.
Yong Wang etc. in [8] proposed intrusion detection in SCADA to identify false data injection
attacks. Intrusion detection improves accuracy using new graph model. Several security objectives
are required to assure the safety of the smart grid system. To ensure confidentiality, the smart grid
should have the ability to avoid expose to the unauthorized system or individual. The smart grid
enforces confidentiality by encrypting data while in transit and restricting access to storage places of
these data. System breach occurs if the user data revealed in any way. The smart grid should protect
the channel between sensors, actuators, and controllers [9]. Sensors, actuators or controllers send and
receive information in a smart grid system. The smart grid system ensures integrity by detecting
and preserving information sent and received by these devices. To ensure integrity the smart grid
system required to have the capability to identify any changes that introduced in the message being
transferred [10]. Availability is more critical in smart grid automation, however, less significant in
smart metering applications. The smart grid system aimed to provide high availability services by
preventing denial of service attack, power outage, and hardware or system failure. Leon Wu, et al
5
[11] described reliability framework for the smart grid system. The framework was working parallel
to evaluate the reliability of several stages and also provides concurrent feedback for the enhancement
of safety. Robert Mitchell, et al. [12] developed a probability-based model to analyze the effect of
intrusion detection and response on the reliability of the cyber-physical system. Robustness defines
the amount to which a structure is capable of working appropriately in the incidence of a disruption.
Matthias Rungger, et al in [13] introduce dynamic stability systems based on the bounded disturbance
and sporadic disturbance.
Trustworthiness describes the degree to which the system can be reliable and trustworthy to achieve
system tasks appropriately under well-defined environment circumstances in a specified period. Bjorn
Stelte, et al., in [14] propose an idea for malicious node detection and protection mechanism to assure
the trustworthiness of sensor data.
6
Chapter 3
Background
3.1 Double Deep Q Learning (DDQN)
In order to improve upon the architecture of existing model, I had to learn about the detailed working
of Deep Q Learning and its variants. In this section , I aim to present a high level overview of the
various algorithms ,starting from Q Learning ,which led to the development of the Double Deep Q
Learning algorithm.
3.1.1 Q Learning
Q Learning is one of the most famous Reinforcement Learning algorithms which uses Q (stands
for Quality) function to estimate reward values, which are used to provide the reinforcement. For
any Finite Markov decision process (FMDP), Q-learning identifies an optimal action selection policy
with the objective of maximizing the expected value of the total reward that can be obtained in the
successive steps (provided that it is given infinite exploration time and a partly-random policy) [23].
It uses Temporal Difference learning for this purpose. Temporal Difference value can be understood
as an estimate of the amount of reward that can be expected in the future. If the T.D. value is
very small, it means that the classifier has understood the environment well and there is little scope
for further improvement. Hence, the major goal is to minimize the T.D. values. The Equation for
updating Q value is as follows:-
Qnew (st , at ) ←− Q(st , at ) + α ∗ (rt + γ ∗ maxa (Q(st+1 , a)) − Q(st , at ))
7
Chapter 3 - Background Double Deep Q Learning (DDQN)
where,st is the state at time t, at is the action taken at time t, rt is the reward obtained at time
t, α is the learning rate , γ is the discount factor, Q(st , at ) is the old Q value, maxa (Q(st+1 , a) is
the estimate of optimal future value, [rt + γ ∗ maxa (Q(st+1 , a)] is the temporal difference target and
[rt + γ ∗ maxa (Q(st+1 , a)) − Q(st , at )] is the temporal difference equation
3.1.2 Deep Q Learning
A major limitation of Q-learning is that it works only in the environments that have discrete and
finite state-action spaces. In order to extend Q-learning to richer environments (where storing the
full state-action table is often infeasible), we use Deep Neural Networks as function approximators
that can learn the value function by taking just the states as inputs. The Deep-Q Learning is one
such solution for applying the concept of Q Learning in more complex environments. It uses Deep
Neural Networks to estimate the Q values of all possible actions at a given state. The loss function is
generally modeled to represent the Temporal Difference equation of Q Learning (as the objective of
reducing T.D. value holds true here as well).
loss = (r + γ ∗ maxa0 Q̂(s, a0 ) − Q(s, a))2
where,s is the state , a is the action taken, r is the reward obtained , γ is the discount factor and
Q̂(s, a0 ) is the delayed Q function. As we can see the loss function is very similar to the temporal
difference function in the case of Q Learning.
Deep Q Learning was also tested against classic Atari 2600 games [15] , where it outperformed
other Machine Learning methods in most of the games, and performed at a level comparable with or
superior to a professional human games tester.
3.1.3 Double Deep Q Learning
However, in their paper [24], Hado et al. explain the frequent overestimation problem found in Deep
Q Learning due to the inherent estimated errors of learning. Such overestimations related errors are
also seen in the Q Learning algorithm and were first investigated by Thrun and Schwartz [23]. They
showed that if the action values contain random errors uniformly distributed in an interval − then
each target is overestimated up to γ ∗ ∗ (m − 1)/(m + 1) where m is the number of actions. They also
gave an example in which such errors led to sub-optimal policies. Later van Hasselt (2010) showed
8
Chapter 3 - Background Curiosity Driven Variational Autoencoder
how noise from environment could lead to overestimations even while using tabular representation, He
proposed Double Q-learning as a solution , in which there is a decoupling of action selection and action
evaluation procedures that help lessen the overestimation problem significantly. Later, Hado et al. [24]
proposed a Double Deep Q Learning architecture that uses the existing architecture and deep neural
network of the DQN algorithm but finds better policies , thereby improving the performance [24].
They use the target network and the current networks to replicate the decoupling that was proposed
in the Double Q learning procedure.
YtDoubleDQN ⇐= rt + γ ∗ Q(st+1 , argmaxa Q(st+1 , a; θt ), θt− )
where,YtDoubleDQN is the Temporal Difference value at time t, st is the state at time t, at is the action
taken at time t, rt is the reward obtained at time t, γ is the discount factor, θt is weight matrix of
the Current Q Network at time t, θt− weight matrix of the Target Q Network at time t,
They tested their model on six Atari games by running DQN and Double DQN with 6 different
random seeds. The results showed that the overoptimistic estimation in the Deep Q Learning was
much more common and severe than what was previously acknowledged. In some cases the over
estimations were so high that log scale had to be used to show comparison with the optimal policy.
The results also showed that Double Deep Q Learning gave state of the art results on Atari 2600
domain.
We have employed the ’decoupling of action selection and action evaluation’ concept to build our
Double Deep Q Learning-based model and evaluated it on a real-world cloud dataset (ISOT-CID)
and a conventional network based NSL-KDD dataset. Evaluation suggests significant improvements
as compared to the previously proposed DQN model and other simple classifiers, as discussed further.
3.2 Curiosity Driven Variational Autoencoder
VAE is a generative model that can learn the unsupervised latent representations of complex high-
dimensional data [10]. The VAE model consists of two parts: encoder qφ (z|x) and decoder pθ (x|z)
. The encoder takes the input sample x and yields the input in latent space z. Then z is fed into
the decoder to predict back the sample x. The main principal of the VAE is to learn the marginal
likelihood of a sample x from a distribution that is parametrized by generative factors z. The marginal
likelihood of a data point x can take following form:
log pθ (x) = λ(x; θ, φ) + DKL (qφ (z|x)||pθ (z|x))
9
Chapter 3 - Background Proximal Policy Optimisation and Actor Critic methods
Since the true data likelihood is usually intractable, instead, the VAE optimizes an evidence lower
bound (ELBO) which is a valid lower bound of the true data log likelihood, denoted as:
λ(x; θ, φ) = E(log pθ (x)) − DKL (qφ (z|x)||pθ (z))
(4) λ(x; θ, φ) consists of two terms: the first term can be assumed as reconstruction loss, and the
second term is the approximated difference between the posterior qφ (z|x) from prior p(z) via KL-
divergence. In general, qφ and pθ are implemented via deep neural networks, and prior p(z) follows
Gaussian distribution N(0, 1).
The CVAE model [8] uses the prediction error as an intrinsic reward to drive the agent to make
a sufficient exploration, which can improve the quality of the generate training samples. It involves
another encoder that generates rt and st+1 given st and at . An error termet is therefore found out
which is defined as follows:-
et = DKL ((st+1 , rt )||(s0t+1 , rt0 ))
This error term is considered as an intrinsic reward and is added to the overall reward estimate. Such
curiosity driven estimator is used to improve the efficiency of our exploration. The loss function in
this case can be approximated as follows:-
λcvae = et − DKL (qφ (zt |st )||N (0, 1))
3.3 Proximal Policy Optimisation and Actor Critic methods
3.3.1 Vanilla Policy Gradient
Due to the low intuitiveness and high variance found in Q Learning based methods , Simple policy
Gradient methods ( where line search is conducted over the gradient of loss functions) are preferred.
A simple implementation of such algorithms is the Vanilla Policy Gradient. It was found to have two
major limitations - 1) It suffered from high variance due to variance in neural network estimations
of value function. 2) It used line search gradient descent policy which lead to lower performance as
compared to Trust Region based policies.
10
Chapter 3 - Background Proximal Policy Optimisation and Actor Critic methods
3.3.2 Trust Region Policy Optimisation
In 2015, TRPO introduces trust region strategies to RL instead of the line search strategy. The TRPO
add KL divergence constraints for enabling the trust-region for the optimisation process. It makes
sure that the new updates policy is not far away from the old policy or we can say that the new policy
is within the trust region of the old policy. It means policy update is not deviating largely. However ,
a major problem with this algorithm was high computational complexity arising due to second order
gradient calculations. [19] is the paper for reference. The following are the major equations here:-

πθ (a | s)
J (θ) = Es∼pπθold ,a∼πθ Âθ (s, a)
old πθold (a | s) old
Es∼pπθold [DKL (πθold (. | s) || πθ (. | s))] ≤ δ
3.3.3 Proximal Policy Optimisation
Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The
motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while
using only first-order optimization.
πθ (at |st )
Let rt (θ) denote the probability ratio rt (θ) = πθold (at |st )
, so . TRPO maximizes a “surrogate” objective:

CPI πθ (at | st ) h i
L (θ) = Êt )Ât = Êt rt (θ) Ât
πθold (at | st )
Where CPI refers to a conservative policy iteration. Without a constraint, maximization of LCP I
would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize
changes to the policy that move rt (θ) away from 1:
h i
J CLIP (θ) = Êt min rt (θ) Ât , clip (rt (θ) , 1 − , 1 + ) Ât
where = 0.2 is a hyperparameter, say, . The motivation for this objective is as follows. The
first term inside the min is LCP I . The second term, clip (rt (θ) , 1 − , 1 + ) Ât modifies the surrogate
objective by clipping the probability ratio, which removes the incentive for moving rt outside of the
interval [1, 1 + ] . Finally, we take the minimum of the clipped and unclipped objective, so the final
11
Chapter 3 - Background Smart Grids
objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme,
we only ignore the change in probability ratio when it would make the objective improve, and we
include it when it makes the objective worse. [20] is the paper for further reference.
3.3.4 Actor Critic Model
The Actor model performs the task of learning what action to take under a particular observed state
of the environment. The job of the Critic model is to learn to evaluate if the action taken by the
Actor led our environment to be in a better state or not and give its feedback to the Actor, hence
its name. It outputs a real number indicating a rating (Q-value) of the action taken in the previous
state. By comparing this rating obtained from the Critic, the Actor can compare its current policy
with a new policy and decide how it wants to improve itself to take better actions. [11] is the paper
for further reference.
3.4 Smart Grids
3.4.1 Description
Smart grids have the potential to deploy millions or even billions of intelligent components into
the electric grid that will communicate in diverse ways than previously thought of Kundur et al.
(2011).Besides, smart grid includes real time transfer of constant information and has numerous
points of passage. smart grid is required to create and allow access to a lot of client energy usage data
(CEUD) (Han and Xiao, 2016a).
The primary objective played by smart grid is energy efficiency, also, energy is created by the
traditional sources as alternating current (AC) which can’t be put away and in this manner is normally
sent and used right away. Accordingly, smart meters have been formed and consolidated into the
network to strengthen client energy awareness by empowering the client to watch and adjust use to
decrease energy expenses and subsequently productively manage available energy.
Smart grid consists of a generation system, transmission system, distribution system, and control
and data centre. The generation system is sometimes distributed and is often operated as part of a
microgrid.
The control and data centres perform advanced control methods such as distributed automation
in real time through two-way communication with the substation. Furthermore, a smart grid system
12
Chapter 3 - Background Benefits
contains intelligent grid systems that are intended to help implement self-healing capabilities into the
grid. The rationale behind self-healing is to detect problems in the grid early and addressing them
as soon as possible without human intervention. This makes smart grid resilient to attacks and as a
consequence increases availability and enhances reliability. Self-healing is needed for smart grid to be
able to redirect and adjust the flow of electricity through alternative paths if there is an interruption,
a task that can 172 J. Jow et al. only be achieved through constant self-assessment of the state of the
power system (Baumeister, 2010). Distributed generation within smart grid allows for the integration
of green energy sources such as solar or wind energy at various points in the grid enabling customers
to generate their power and also send excess energy back to the grid. Plug-in electric vehicles can
connect to the grid and charge or in some cases supply charge to the grid as deemed necessary (Han
and Xiao, 2015, 2016a, 2016d). Smart meters gather information on customer usage which is later
used to monitor efficiently and adopt to supply needs by anticipating peaks usage.
There are different technologies that play a vital role at the distribution level. A Smart Grid should
utilize these technologies in order to move the distribution system forward [10, 12]. This includes:
Advanced digital meters, Distribution automation, Low-cost communication systems, Distributed
energy resources, Broadband communications for distribution applications, Closed loop systems using
advanced protection, Distributed storage and generation, Real-time angle and voltage stability and
collapse detection, Reactive power control based on intelligent coordination controls, and Fault analysis
and reconfiguration schemes based on intelligent switching operations. A true Smart Grid will not
utilize these technologies as separate issues. Instead, it will integrate in order to maximize the benefits.
3.4.2 Goals
In order to design a Smart Grid, certain goals should be taken into account such as: observability;
create controllability of assets, enhance power system performance and security; and reduce costs of
operations, maintenance, and system planning.
3.5 Benefits
If these goals are achieved, many benefits can be harvested such as: Improved system performance
meters, Better customer satisfaction, Improved ability to supply information for rate cases; visibility
of utility operation / asset management, Availability of data for strategic planning, as well as better
support for digital summary, More reliable and economic delivery of power enhanced by information
flow and secure communication, Life cycle management, cost containment, and end-to-end power
13
Chapter 3 - Background Types of attacks
Figure 3.1: benifits of smart grid

delivery, and Impact access to historical data for strategic planning.
3.6 Types of attacks
Attack taxonomy includes traditional cyber-attacks and specific attacks that can be accomplished
across the smart grid system domain. Novice or irresponsible adversary can enter the network and
directly interrupt the anxious processes to cause a disaster. On the other hand, a more sophisticated
attacker may not disrupt the operation of the normal system to launch distributed attack [13]. This
type of attack is more difficult to detect.
Table 3.1: security threats for smart grids
Eavesdropping Masquerade
EM/RF Interception Repudiation
Media Scavenging Bypassing Controls
Indiscretions by Personnel Theft
Intercept/ Alter Authorization Violation
Replay Physical Intrusion
Virus/Worms Service Spoofing
Trojan Horse Man in the Middle
Trapdoor Integrity Violations
Cheating Customer Resource Exhaustion
14
Chapter 3 - Background IDS specific to Smart Grids
3.7 IDS specific to Smart Grids
Smart grid network is proposed of Home Area Network (HAN), Neighborhood Area Network (NAN),
and Wide Area Network (WAN).
Figure 3.2: Layer architecture of smart grid.
3.7.1 Intrusion Detection in Home Area Network
Home area network is the first layer of the smart grid comprises of the service component and metering
component. Service module provides energy consumption and cost. The metering module provides
consumer home energy consumption. There will be one IDS for every HAN. IDS will track inbound
and outbound communication to home area network to identify security breaches. At the time of
security breaches, IDS will notify home area network and send this information to central operational
network administrator of smart grid for further processing and necessary action.
3.7.2 Intrusion Detection in Neighborhood Area Network
Neighbourhood area network (NAN) is the second layer of the smart grid. The neighborhood area
network is a large network, which collects service and metering information of multiple HANs that
are geologically adjacent each other. The neighborhood area network consists of the smart meter
data collector (SMDC) and a central access controller (CAC). There will be one IDS for every NAN.
CAC act as an interface among HANs and energy supplier communication. The SMDC control all
metering record for the entire HANs in neighborhood area network. All incoming and outgoing data
15
Chapter 3 - Background IDS specific to Smart Grids
will be passed through the neighborhood area network IDS for possible security threats. At the time
of security breaches, IDS will send a notification to central operational network administrator of smart
grid for supplementary handling and essential action.
3.7.3 Intrusion Detection in Wide Area Network
Wide area network provides wireless and wired communication among distributed network devices,
substation, NANs. This layer also contains SCADA controller, energy distributed system (EDS). There
will be many IDS sensors at various location in the wide area network which are part of smart grid but
not included in NANs. EDS regulate metering data and energy distribution. SCADA provides control
to manage distribution grid elements. Data collected from IDS sensors in WAN will be correlated for
possible malicious activity or violation of any security policy. The network administrator of the smart
grid will be informed accordingly for additional processing and necessary action.
3.7.4 Deploying IDS in the Smart Grid
Our proposed framework includes several IDS sensors and central IDS management unit. Each HAN
would be protected by separate IDS device. This IDS is configured by the smart grid operator and
is responsible for detecting attacks in one HAN. Similarly, every NAN should be secured by separate
IDS. NAN IDS can detect any malicious activity in one NAN for all incoming and outgoing network
traffic. WAN contains many IDS sensors that audit communication between the energy generation,
transmission, distribution, and SCADA controller to maintain security in a smart grid environment.
IDS sensors in HANs, NANs, and WAN send alerts to a central operational IDS management system.
Central IDS management system is responsible for further processing of these alerts and provides
functionality for identifying malicious activity in the smart grid.
16
Chapter 4
Dataset Description
4.1 NSLKDD Dataset
The description of NSL-KDD dataset is given as follows. Each record of the dataset [47,48] consists of
41 features divided into four different categories having basic, content, traffic and host features. Each
record in the dataset is labelled as normal or a specific class of attack. The training dataset consists
of 23 traffic classes that include 22 attack classes and one normal class. The test dataset includes 38
attack classes out of which 16 are novel attack classes and one normal class [23]. The attack classes
are divided into four types, namely DOS, Probe, U2L, and R2L. The distribution of labelled data in
the training and testing dataset in NSLKDD is shown in 4.1.
Figure 4.1: Distribution of labelled data in the training and testing dataset in NSLKDD
17
Chapter 4 - Dataset Description ISOT-CID dataset
4.2 ISOT-CID dataset
4.2.1 overview
To evaluate our model we have used the ISOT Cloud Intrusion Dataset (ISOT-CID) , which is the first
publicly available cloud-specific [1] dataset. The data set was collected over the cloud infrastructure of
the ‘Compute Canada’ cloud service provider, that provides its services for supporting the computa-
tional needs of researchers [4]. The data was collected at various cloud layers of the OpenStack based
production environment , including hypervisor layer, guest hosts layer and the network layer. For our
purpose, we have used only the network traffic data portion of the ISOT-CID . The dataset consists
of data obtained from a variety of sources including network traffic, system logs, CPU performance,
memory dumps, system call traces ,etc. We have considered the phase 2 of the two phases that the
dataset was collected in. Aldribi et al. (2018) [29] contains description of the cloud platform, data
collection procedures, and Phase 1 dataset. Data collection in both the phases was made on the same
cloud environment and the same collection procedures were followed. However, the second phase of
collection occurred more recently and covered a wider variety of newer attack vectors .
4.2.2 Cloud environment and Data collection outline
The ISOT-CID collection environment contained three hypervisor nodes ( A, B, and C) . The cloud
environment also consists of 10 instances (V.M.1 - V.M.10) which were launched in three cloud zones
named A, B, and C (Fig. 3 shows the cloud environment). Five instances (VM2, VM3, VM4, VM5
and VM6) were launched in zone A, four instances were launched (VM7, VM8, VM9 and VM10) in
zone B, and one instance (VM1) was launched in zone C. 4.2 describes the structure.
The data was collected in the cloud for several days with time slots of 1–2 hours per day. Data
was collected with the help of various collector agents which were classified and integrated into the
three cloud layers : VM or Instance-based agents, Hypervisor-based agents and Network-based agents
. The data collected through these collectors were forwarded to the ISOT lab log server for storage
and analysis. The dataset contains both normal and malicious activities. The malicious data consists
primarily of attacks executed . However , there were some unsolicited malicious sessions that were
undertaken by unknown external hackers ,which were identified ,by the tiger team ,based on the source
IP addresses and attack timing. A wide variety of attack scenarios were covered including simultaneous
attack scenarios , coordinated attack scenarios, etc. Moreover , a wide variety of distinct geographical
locations in Europe, North America, and Asia were used for launching the attacks. The normal
data collected was also quite varied and complex involving 160 legitimate visitors ,ranging from data
18
Figure 4.2: Cloud Environment Architecture used for ISOT Data collection
involved in maintaining the status of VMs, rebooting, updating, creating files, SSHing to the machine,
etc . The types of attacks are further mentioned in Table 4.1.
Table 4.1: ISOT dataset attack type distribution
Insider Attacks Outsider attacks

Trojan Horse Unclassified (unsolicited traffic)
Backdoor (reverse shell) DNS amplification DOS
Unauthorized Crypto- mining Ports and Network scanning
UDP Flood DOS Dictionary/Brute Force login attack
Stepping Stone Attack HTTP Flood DOS
Ports and scanning Network Directory/Path Traversal
Synflood DOS Dictionary/Brute Force login attack
Revealing Users and Confidential Data Fuzzers
Dictionary/Brute Force login attack Synflood Dos
4.2.3 Attack categories
The ISOT-CID malicious activities were divided into outside and inside attacks, based on whether they
were performed by outsiders or insiders respectively. The outside attacks comprised those that were
made from the outside world (by the tiger team or the unsolicited activities). The inside malicious
activities were perpetrated by either an insider within the cloud environment who had high previleges
on the hypervisor nodes or by a compromised VM within the cloud environment that was later as a
stepping stone for attacking other instances in the cloud or the outside world.. Some of the inside
attacks were network scanning, password cracking, backdoor and Trojan horse, DoS attacks, etc. [29]
19
Figure 4.3: File Structure of ISOT CID

4.2.4 Network Traffic Data
The entire ISOT data set of size 8TB consisted of 55.2GB of network traffic data . The network
traffic data was composed of three levels of network communications, :-
• external traffic -traffic between the instances
• internal traffic or hypervisor traffic - traffic between the hypervisor nodes
• local traffic - traffic between two VMs on the same hypervisor node.
The collected network traffic data was stored in packet capture (pcap) format and made available
for public use. In phase 1, a total of 22,372,418 packets were captured out of which 15,649 (0.07%
) were malicious . Whereas in phase 2, a total of 11,509,254 packets were captured out of which
2,006,382 (17.43%) were malicious . The data collected were organised into folders based on the date
and hypervisor on which the data collection took place.
4.2.5 File Structure
The Logs directory contains two main directories, named based on the two collection phases, Phase
1 and Phase 2. Under each directory, there are 3-5 levels of subdirectories and files. The fourth level
of directories contains the dataset collected from both hypervisors and VMs stored under each attack
day separately. The fifth level of directories contains different directories for data types and sources
and named accordingly. The directory structure can be seen in 4.3 .
20
Chapter 4 - Dataset Description Preprocessing
4.3 Preprocessing
4.3.1 Tranalyzer
Due to the fact that packet payload processing involves huge amount and rate of data that have to be
processed, flow based analysis for intrusion detection are considered better for high speed networks
due to lower processing loads. [9]. However , the problem with such flow based analysis is high false
alarm rates[3]. To obtain flow based data from packet based data ,we have used this open source tool
called Tranalyzer which is a lightweight flow generator and packet analyzer designed for practitioners
and researchers. [2] It was used to process the pcap files and files containing multiple JSON files were
obtained as output. With the help of tranalyzer, we were able to get about 1.8GB of output flow
based JSON data from about 32.2GB of packet based input data in pcap format.
4.3.2 Processing of Tranalyzer output
All the 37 json object files which were output by Tranalyzer ,were parsed and extra fields were found
and removed. Each of the JSON object had different number of fields based on certain properties
which were peculiar to a given flow. Such extra fields were removed while parsing through the JSON
objects.A list of dictionary corresponding to the JSON objects was obtained. This list was used to
create the required CSV file. This CSV file was further processed to deal with lists of numbers and
lists of strings. Average of all the items of the list of numbers was used to replace such lists. Similarly,
the first string was used to replace the list of strings. Finally all the strings and hexadecimal values
(representing particular characteristics of flow) were one hot encoded for further improvement of the
training data. The values that weren’t integers or floating point numbers , were converted to ’Nan’
values. Chi Square feature seection was done. Finally the rows and columns having majorly ‘Nan’
values were removed from the dataset and the remaining ‘Nan’ values were replaced either with mean
of all other values in the column or with zeroes, based on the characteristics of the corresponding
feature. We finally labelled the data set based on the list of malicious IP addresses that was provided
along with the ISOT CID documentation [3]. There were 272419 non attack type tuples and 9883
attack type tuples. Finally , the data frame object was converted to Numpy Array to be used by our
models. After preprocessing we found out that our data set was highly skewed ,i.e, the number of
Non Attack samples was much higher than that of the number of attack samples. Hence, to prevent
biased learning, we selected a portion of the dataset which had a more balanced distribution having
9883 attack samples and 14824 normal samples. Table 4.2 shows the distribution of the dataset in
training and testing phase.
21
Chapter 4 - Dataset Description Preprocessing
Table 4.2: ISOT dataset distribution
Dataset Total Normal Attack

Training 17296 10377 6919
Testing 7411 4447 2964
22
Chapter 5
Proposed Intrusion Detection
System
In this section, we present our proposed intrusion detection system using deep reinforcement learn-
ing. Before presenting the components of our system, we introduce dataset and essential processing
elements related to our system.
5.1 Our Proposed Model
This section discusses the deployment architecture of our proposed cloud IDS. It includes broadly four
sub-components as host network, agent network, administrator network, and experience replay unit.
Figure 5.1 presents the DRL based cloud IDS architecture. The host network contains the running
VMs, hypervisors, and host machines. The agent network connects to the host network via Virtual
Private Network (VPN) taps that protects it from being compromised from an external attacker.
Also, the intruder can alter the system call traces to appear as normal so that the detection system
fails to identify the intrusion. Hence, the VPN communication should be quick to share all related
information before the attacker makes any manipulation.
The agent network obtains network packet information via VPN from host network and performs
mainly Flow based analysis, and necessary preprocessing on these data to extract feature vectors. The
major preprocessing and flow analysis steps have been explained already in section 7.
State: State in RL describes the input by environment to agent for taking actions. In our case the
23
Chapter 5 - Proposed Intrusion Detection System Our Proposed Model
Figure 5.1: Cloud IDS Deployment Architecture

state is described by the flow based features of the network flow at a given point of time.
Action: An action refers to the agent’s decision after monitoring the state of the cloud system
during a given time window. It applies a suitable policy on the output Q-values of the current deep
Q-network and obtains the agent result (refer Fig 5.2).
Algorithm 1: CVAE DDQN Logic

1 Initialise Replay Memory D with capacity N, Generate Replay Memory Dg with capacity Ng , Minibatch size M,
proportion factor g ;
2 Initialise the action value function and target value functions with weights theta and thetaT ;
3 for episode =1 to I do
4 Observe state s0 ;
5 for t =1 to I do
6 choose an action at based on epsilon- greedy policy;
7 Observe transition (st ,at ,rt ,st+1 );
8 store transition (st ,at ,rt ,st+1 ) in D;
9 sample random minibatch of transition (st ,at ,rt ,st+1 ) from D;
10 Generate transition (st ,at ,r’ t ,s’ t+1 );
11 compute the prediction error et ;
12 store transition (st ,at ,r’ t +betaet ,s’ t+1 ) in D g ;
13 Randomly sample M X(1 − G) of transition (sj ,aj ,rj ,sj+1 ) from D;
14 Randomly sample M X(G) of transition (sj ,aj ,rj ,sj+1 ) from Dg ;
15 if episode terminates at step j+1 then
16 yj ← rj ;
17 else
18 y j ← r j + γmaxa’ Q(sj+1 , a’ ; θ - );
19 end
20 Gradient Descent on (y − Q(sj+1 , a; θ))2 w.r.t. network paramenters θ;
21 in every C steps θ - ← θ;
22 end
23 end
Reward: A reward indicates the feedback from the environment about the action by an agent. In
our case the reward is calculated by subtracting the Q2 value from the Q1 value and it is multiplied
by -1 if the policy based decision deviates from the ground truth.
24
Figure 5.2: Pictorial representation for agent result calculation using DDQN and prioritised learning.
CP1 , CP2 ,...CPk represents the classifier prediction in form of 1(for attack) and 0(for normal) for
classifier C1 , C2 ,...Ck respectively
DDQN architecture In DDQN ,the major objective is to handle the overestimation of action
values that takes place in DQN. One of the major principles behind the DDQN is the decoupling of
action selection and action valuation components. This is achieved in our case by using two different
Neural Networks. One of them implements the current Q function while the other implements the
target Q function. Here, back propagation takes place in current Q Neural Network and its weights are
copied into the target Q Neural Network with delayed synchronization (copying is done after regular
intervals of a fixed number of epochs). In experimentation we have used epoch interval as 32 as it
was found to give optimal results in most of the cases. The actions (raising appropriate alarms) are
taken as per the current Q function but the current Qnew values are estimated using the target Q
function . This is done to to avoid the moving target effect when doing gradient descent. Similar
approach has been used by Manuel Lopez-Martin, Belen Carro, Antonio Sanchez-Esguevillas in their
work [14]. This method of delayed synchronization between two Neural Networks ensures the required
decoupling and thereby handling the moving target effect of DQN.
Here, we present our algorithms for intrusion detection. Algorithm 1 shows the working of the
CVAE along with DDQN in each target network update cycle. The function of the administrator
25
Algorithm 2: Administrator Network Logic

1 Get agent result and f eature vector from the agents;
2 for each agent do
3 p = number of bits set in agent result;
4 k = number of bits not set in agent result;
5 if p ≤ k then
6 status ← ”normal”;
7 else
8 status ← ”attack”;
9 end
10 if (status == ”attack”) then
11 pre-process feature vector for attack classification;
12 Input processed feature vector to classifier and get the attack type
13 attack type = output of classifier;
14 else
15 attack type = ”normal”;
16 end
17 Get the actual result from the environment;
18 Send the actual result to the agent for use in calculation of rewards;
19 end
network is shown in Algorithm 2. It uses a voting system to identify the presence or absence of an
attack.
Functional task of Cloud Administrator The cloud administrator runs algorithm 2 where it
monitors the activities of the cloud system constantly and detect its state. On receiving agent results, it
check for the intrusion and accordingly share the actual result to agent. It also identifies the location
of the intrusion including entry doors and target VMs. INterplay between the components
The Agent Network gets its input froom the Host Network. It conducts flow based analysis and
preprocessing before feeding the input to the DDQN moodel. The DDQN model predicts the output
. The actual result is obtained from the administrative network. Based on the result, the reward is
calculated .The input state along with the action , reward and output state is stored in a experience
pool. The CVAE generates tuples that get an additional intrinsic curious reward . Normal Experience
Replay tuples are considered with probability G and the generated experience tuples are considered
with probability 1-G . These tuples are then used to retrain the DDQN model.
26
Chapter 5 - Proposed Intrusion Detection
Architecture
SystemModifications for PPO Powered Actor Critic Model
5.2 Architecture Modifications for PPO Powered Actor Critic
Model
Figure 5.3: Pictorial representation of PPO based Actor Critic Model deployment
As can be seen in 5.3 , the architecture used for DDQN CVAE algorithm can be used for PPO AC
algorithm with minor modifications. We would no longer need the experience replay pool or CVAE
data pool. The resulting architecture is much simpler as a consequence. As can be seen in 5.4, we use
two neural networks in this case as well. However, unlike in DDQN CVAE , here back propagation
happens in both the neural nets- one for actor and other for critic. The output of the actor model
helps us decide the action to take 1.e whether to raise alarm or not. The critic helps find the value of
27
Architecture
a given state and its output is used in Generalised Advantage Estimation algorithm. As can be seen
in 3, we train in episodes of 128 tuples. We get the output for 128 tuples first. We then calculate
the advantage values using the Generalised Advantage Estimation Algorithm. These inputs help us
calculate the ppo loss used for training the actor model. The critic model is trained using the Mean
Squared Error loss function. Although we have trained actor and critic model sequentially , they
can be trained in parallel as well. We can also run the GAE algorithm in a pipelined fashion as well
promoting further parallelism.
Figure 5.4: Pictorial representation of PPO based Actor Critic Model
28
Architecture
Algorithm 3: Proximal Policy Optimised Actor Critic Model Logic

1 Initialise the actor NN function and Critic NN functions with weights θ and θ C ;
2 for episode =1 to I do
3 save policy pinew ;
4 for i =1 to 128 do
5 Observe state si ;
6 input input vector to actor and critic networks;
7 Get actor result using epsilon- greedy policy ai ;
8 Get the new state si+1 Get critic result ,v i ;
9 Get the reward r i ;
10 store the transition (si ,ai ,vi ,rt ,si+1 );
11 end
12 Initialize gae =0;
13 for i =127 to 0 do
14 δ = r t + γ ∗ v i+1 − v i+1 ;
15 gaet = δ + γ ∗ λ ∗ gaet+1 ;
16 end
17 ratio = π new /π old ;
18 Get loss for each tuple;
19 actor loss = min(ratio ∗ gae, clip(ratio, 1 − , 1 + ) ∗ gae);
20 Get Critic loss;
21 store policy piold ;
22 perform gradient update on actor and critic based on the losses;
23 end
29
Chapter 6
Experimental results and discussion
We implemented our models in python language and evaluated their performance on ISOT and NSL
KDD datasets. The experimental results include four standard machine learning performance metrics,
i.e., FPR (False Positive Rate), TPR (True Positive Rate), ACC (Accuracy), and AUC (Area under
ROC Curve). Although CVAE DDQN is an online learning model, while implementing it we have
used a part of the Datasets for the purpose of testing as well. we have implemented the preprocessing
to create lists of json objects and later converted it into csv files(on case of ISOT Dataset). Also the
flow based analysis was done before implementing CVAE-DDQN , but it can be done in parallel to
the CVAE-DDQN model. Similarly, while implementing ,we have implemented CVAE in sequence
with the DDQN algorithm but it can and will be run in parallel in the real world scenario. Similar
paralleisam can be conducted with the PPO Actor Critic Model as well Upon testing with datasets,
we obtain the following result:-

Table 6.1: Model Performance on ISOT CID and NSLKDD Dataset
Model Dataset Accuracy FPR AUC
CVAE DDQN ISOT-CID 98.16 1.56 0.896
CVAE DDQN NSL-KDD 89.20 1.77 0.8812
PPO A-C ISOT-CID 97.44 1.16 0.868
PPO A-C NSL-KDD 87.08 1.64 0.872
DDQN [22] ISOT-CID 96.87 1.57 0.886
DDQN [22] NSL-KDD 83.40 1.48 0.8432
Anomaly Based [13] NSL-KDD 82.1 - -
6.1 Performance of model on continuously changing attack
types
ISOT dataset collects the logs in a span of eighteen days where each day has new attack types. How-
ever, the majority of the volume belongs to first six days and each of these days has new attack type.
To understand the efficacy of model regarding adaptivness towards novel attacks, we did experimen-
30
Chapter 6 - Experimental results and discussion Attack specific classification on NSL-KDD Dataset
tation where model faces new attack constantly and note down its accuracy, FPR and AUC (refer
Table 4). The performance of any ith day is obtained by training the model to dataset belonging
from day 1 to day (i-1) and evaluating it on the i th day dataset. This is similar to situation in a
real-world where model would face novel attacks on each new passing day and its prediction would
depend on the learning from past. As can be seen from Table 4, our models perform fairly well even if
they are trained for a few days and tested on unknown attack types. The consistent improvements in
metrics like Accuracy, False Positive Rates and Area Under Curve ,in subsequent days, suggest high
adaptability and robustness in long term use.

Table 6.2: Performance of DDQN CVAE model on daily changing attack type
Day Attack Type No of samples ACC 1 FPR AUC

1 DTA 2and UCM 3 24622 - - -
2 NS 4 66124 83.11% 3.31% 0.8421
3 SQLI 5
,CSS 6, PT 7
,S-DOS 8 36517 90.16% 2.24% 0.8801
4 BFLA(failed) 9 43489 93.11% 2.32% 0.9301
5 UCM 10,DNSADOS 11 , HTTPFDOS 12
48716 94.10% 2.00% 0.9448
1:ACC: Accuracy, 2:DTM:Dictionary Traversal Attack, 3:UCM:Unauthorized Crypto-mining ,
4:NS:Network scanning, 5:SQLI:SQL Injection, 6:CSS:=Cross-site Scripting(XSS), 7:PT:Path
Traversal, 8:S-DOS:Slowloris DOS, 9:BFLA:Brute Force login attack 10:UCM:Unauthorized Crypto-
mining, 11:DNSADOS:DNS amplification DOS 12:HTTPFDOS:HTTP flood DOS
Table 6.3: Performance of PPO Actor Critic model on daily changing attack type
Day Attack Type No of samples ACC 1 FPR AUC

1 DTA 2and UCM 3 24622 - - -
2 NS 4 66124 81.24% 2.80% 0.8412
3 SQLI 5
,CSS 6, PT 7
,S-DOS 8 36517 89.99% 2.05% 0.8774
4 BFLA(failed) 9 43489 92.56% 1.99% 0.9211
5 UCM 10,DNSADOS 11 , HTTPFDOS 12
48716 93.01% 1.81% 0.9312
1:ACC: Accuracy, 2:DTM:Dictionary Traversal Attack, 3:UCM:Unauthorized Crypto-mining ,
4:NS:Network scanning, 5:SQLI:SQL Injection, 6:CSS:=Cross-site Scripting(XSS), 7:PT:Path
Traversal, 8:S-DOS:Slowloris DOS, 9:BFLA:Brute Force login attack 10:UCM:Unauthorized Crypto-
mining, 11:DNSADOS:DNS amplification DOS 12:HTTPFDOS:HTTP flood DOS
6.2 Attack specific classification on NSL-KDD Dataset
We have a variety of attacks that have been captured throught the NSL-KDD Dataset. We train and
test our models on attack specific selections from the dataset and present our result here.
Table 6.4: Performance of CVAE DDQN model

on daily changing attack type
Sl No Attack Type ACC FPR AUC

1 DOS 98.80 4.1 0.974
2 Probe 86.01% 11.41% 0.8421
3 R2L 88.41% 0.33% 0.8408
4 L2R 90.88% 4.08% 0.8891
Table 6.5: Performance of PPO Actor Critic

model on daily changing attack type
Sl No Attack Type ACC FPR AUC

1 DOS 96.10 3.4 0.945
2 Probe 88.01% 6.21% 0.8946
3 R2L 85.12% 0.16% 0.8025
4 L2R 88.84% 3.48% 0.8664
31
Chapter 7
Conclusion and Future Work
7.1 Future Work
In the future , I plan to work to build distributed , highly scalable IDS that can work very well in real
time , in real world scenarios. Also , a the data-sets and the experimentation’s can be further utilised
to get deeper insights . Also , the deployment and testing can be done on a real world smart grid
to get a realistic performance evaluation of our model and to create a smart-grid specific Intrusion
Detection System. Also I would like to make use of GAN based Reinforcement Learning model and
Rainbow DQN based model in future.
7.2 Conclusion
During my Btech project work under Dr Padmalochan Bera ,I worked on building advanced DRL
guided NIDSs that could provide very high accuracy and low FPR when evaluated on a smart grid-
specific environment. For this the major task was to find and pre process large volume of data that the
data sets carried. We could obtain a few 100 MB worth of extremely relevant data from about 8 TB
of data that was present in the ISOT CID dataset.We also worked to find out relevant data from the
NSLKDD dataset. Our aim was to meet the real-world constraints of limited processing resources and
to ensure the adaptability of our models towards novel attacks changing attack patterns. For this,
we did experimentation on the dataset with a flow-based technique that is computationally lighter.
We introduced the Double Deep Q Network-based IDS to handle the overestimation of action values
in the Deep Q Learning-based model. We combined the model with Curiosity Driven Autoencoders
to generate meaningful experiences to retrain our model with. We also proposed Proximal Policy
32
Chapter 7 - Conclusion and Future Work Conclusion
Optimised Actor Critic Reinforcement Learning Model for achieving similar accuracy with lower FPR
rates and much lower computational and logical complexity .Experiments show highly desirable results.
We also conclude that while both give excellent combined results, CVAE DDQN is more suitable for
higher accuracy while PPO Actor Critic is more suitable for lower FPR requirements. Section the
6.1 shows our systems ability to handle newer attack types (even with very little training data). The
experimentation results show high usability and effectiveness of the model for deploying in smart grid
platforms. We intend to deploy the proposed architectures in a practical smart grid environment and
evaluate their performance in the future. I played an integral role in all the processes that were a part
of the design and implementation of our models and take extreme pleasure in finding the encouraging
results of our models . Also I would like to make use of GAN based Reinforcement Learning model
and Rainbow DQN based model in future.
33
Chapter 8
Publications
1. [Published] Kamalakanta Sethi,Dinesh Mohanty Rahul Kumar, Padmalochan Bera, ”Robust
Adaptive Cloud Intrusion Detection System Using Advanced Deep Reinforcement Learning,”
2020 In book: Security, Privacy, and Applied Cryptography Engineering doi: DOI: 10.1007/978-
3-030-66626-2 4
2. [Accepted] Dinesh Mohanty, Kamalakanta Sethi,Sai Prasath, Padmalochan Bera, ” Intelli-
gent Intrusion Detection System for Smart Grid Applications ”, 2021 International Conference
on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA).
34
Chapter 9
References
Bibliography
[1] Isot cid website.
[2] Tranalyzer documentation.
[3] Onyekachi Nwamuo Paulo Magella de Faria Quinan Abdulaziz Aldribi, Issa Traore. Documenta-
tion for the isot cloud intrusion detection benchmark dataset(isot-cid), 2020.
[4] Abdulaziz Aldribi, Issa Traoré, Belaid Moa, and Onyekachi Nwamuo. Hypervisor-based cloud
intrusion detection through online multivariate statistical change tracking. Computers Security,
88:101646, 2020.
[5] Z. Chiba, N. Abghour, K. Moussaid, A. El omri, and M. Rida. A cooperative and hybrid network
intrusion detection framework in cloud computing based on snort and optimized back propaga-
tion neural network. Procedia Computer Science, 83:1200 – 1206, 2016. The 7th International
Conference on Ambient Systems, Networks and Technologies (ANT 2016) / The 6th International
Conference on Sustainable Energy Information Technology (SEIT-2016) / Affiliated Workshops.
[6] Y. Yang et al. ‘intrusion detection system for network security in synchrophasor systems,’. In
IET Int. Conf. Inf. Commun. Technol. (IETICT), pages 246–252, 2013.
[7] N. Goldenberg and A. Wool. ‘accurate modeling of modbus/tcp for intrusion detection in scada
systems,’. In Int. J. Critical Infrastructure Protection, volume 06, pages 63–75, 2016.
[8] Wang H. Mao CG. Han GJ., Zhang XF. Curiosity-driven variational autoencoder for deep q
network. In Lauw H., Wong RW., Ntoulas A., Lim EP., Ng SK., Pan S. (eds) Advances in
35
BIBLIOGRAPHY BIBLIOGRAPHY
Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science, vol
12084. Springer, Cham., pages 1–6, 2020.
[9] ALI AL MAZARI 2 HASHEM ALAIDAROS1, MASSUDI MAHMUDDIN1. An overview of flow-
based and packet-based intrusion detection performance in high speed networks. In Proceedings
of the International Arab Conference on Information Technology, 2011.
[10] H. A. Kholidy and F. Baiardi. Cidd: A cloud intrusion detection dataset for cloud computing
and masquerade attacks. In 2012 Ninth International Conference on Information Technology -
New Generations, pages 397–402, 2012.
[11] Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Society for Industrial and Applied
Mathematics, 42, 04 2001.
[12] Z. Li, W. Sun, and L. Wang. A neural network based distributed intrusion detection system on
cloud platform. In 2012 IEEE 2nd International Conference on Cloud Computing and Intelligence
Systems, volume 01, pages 75–79, 2012.
[13] J. R. Williams M. A. Faisal, Z. Aung and A. Sanchez. ‘data-streambased intrusion detection
system for advanced metering infrastructure in smart grid: A feasibility study,. In IEEE Syst.
J., volume 09, pages 31–44, 2015.
[14] Manuel López Martı́n, Belén Carro, and Antonio Sánchez-Esguevillas. Application of deep re-
inforcement learning to intrusion detection for supervised problems. Expert Syst. Appl., 141,
2020.
[15] Kavukcuoglu K. Silver D. et al. Mnih, V. Prioritized experience replay. In Nature 518, pages
1–6, 2015.
[16] M. Ghogho N. Boumkheld and M. El Koutbi. ‘intrusion detection system for the detection
of blackhole attacks in a smart grid’. In Proc. 4th Int. Symp. Comput. Bus. Intell. (ISCBI),
volume 01, page 108–111, 2016.
[17] S. Parampottupadam and A. Moldovann. Cloud-based real-time network intrusion detection
using deep learning. In 2018 International Conference on Cyber Security and Protection of Digital
Services (Cyber Security), pages 1–8, 2018.
[18] T. Morris S. Pan and U. Adhikari. ‘developing a hybrid intrusion detection system using data
mining for power systems,’. In IEEE Trans. Smart Grid, volume 06, pages 3104–3113, 2015.
[19] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust
region policy optimization, 2017.
[20] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms, 2017.
36
BIBLIOGRAPHY BIBLIOGRAPHY
[21] K. Sethi, R. Kumar, N. Prajapati, and P. Bera. Deep reinforcement learning based intrusion
detection system for cloud infrastructure. In 2020 International Conference on COMmunication
Systems NETworkS (COMSNETS), pages 1–6, 2020.
[22] Kamalakanta Sethi, Rahul Kumar, Dinesh Mohanty, and Padmalochan Bera. Robust adaptive
cloud intrusion detection system using advanced deep reinforcement learning. In Lejla Batina,
Stjepan Picek, and Mainack Mondal, editors, Security, Privacy, and Applied Cryptography Engi-
neering, pages 66–85, Cham, 2020. Springer International Publishing.
[23] S. Thrun and A. Schwartz. Prioritized experience replay. In M. Mozer, P. Smolensky, D. Touret-
zky, J. Elman, and A. Weigend, editors, Proceedings of the 1993 Connectionist Models Summer
School, Hillsdale, NJ, 1993. Lawrence Erlbaum), pages 1–6, 1993.
[24] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double
q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.
[25] Wei Xiong, Hanping Hu, Naixue Xiong, Laurence T. Yang, Wen-Chih Peng, Xiaofei Wang, and
Yanzhen Qu. Anomaly secure detection methods by analyzing dynamic characteristics of the
network traffic in cloud communications. Information Sciences, 258:403 – 415, 2014.
[26] Y. H. Lim Y. Kwon, H. K. Kim and J. I. Lim. “a behavior-based intrusion detection technique
for smart grid infrastructure,”. In IEEE Eindhoven PowerTech, volume 0, pages 1–6, 2015.
[27] H. Yoo and T. Shon. “novel approach for detecting network anomalies for substation automation
based on iec 61850,”. In Multimedia Tools Appl., volume 74, pages 303–318, 2015.
[28] X. Huo P. Pei Y. Liang Z. Feng, S. Qin and L. Wang. “snort improvement on profinet rt
for industrial control system intrusion detection,”. In 2nd IEEE Int. Conf. Comput. Commun.
(ICCC), pages 942–946, 2016.
[29] I. Traore A. Aldribi and B. Moa. Data sources and datasets for cloud intrusion detection modeling
and evaluation. 2018.
37

Dinesh BTP Thesis 1

Uploaded by

Copyright:

Available Formats

Dinesh BTP Thesis 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dinesh BTP Thesis 1

Uploaded by

Copyright:

Available Formats

CVAE DDQN and PPO Actor Critic based IDS for Smart

Thesis Submitted in partial fulfillment of

Under the guidance of

SCHOOL OF ELECTRICAL SCIENCES

Dr. Padmalochan Bera

List of Figures vii

List of Tables viii

2 Related work based on literature study 3

3.1 Double Deep Q Learning (DDQN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.2 Deep Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.3 Double Deep Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Curiosity Driven Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Proximal Policy Optimisation and Actor Critic methods . . . . . . . . . . . . . . . . . 10

3.3.1 Vanilla Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3.2 Trust Region Policy Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.3 Proximal Policy Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.4 Actor Critic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Smart Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.6 Types of attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.7 IDS specific to Smart Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.7.1 Intrusion Detection in Home Area Network . . . . . . . . . . . . . . . . . . . . 15

3.7.2 Intrusion Detection in Neighborhood Area Network . . . . . . . . . . . . . . . 15

3.7.3 Intrusion Detection in Wide Area Network . . . . . . . . . . . . . . . . . . . . 16

3.7.4 Deploying IDS in the Smart Grid . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 NSLKDD Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 ISOT-CID dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.2 Cloud environment and Data collection outline . . . . . . . . . . . . . . . . . . 18

4.2.3 Attack categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.4 Network Traffic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.5 File Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3.2 Processing of Tranalyzer output . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Proposed Intrusion Detection System 23

5.1 Our Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Architecture Modifications for PPO Powered Actor Critic Model . . . . . . . . . . . . 27

6 Experimental results and discussion 30

6.1 Performance of model on continuously changing attack types . . . . . . . . . . . . . . 30

6.2 Attack specific classification on NSL-KDD Dataset . . . . . . . . . . . . . . . . . . . . 31

7 Conclusion and Future Work 32

7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1 benifits of smart grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Layer architecture of smart grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Cloud Environment Architecture used for ISOT Data collection . . . . . . . . . . . . . 19

4.3 File Structure of ISOT CID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 Cloud IDS Deployment Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3 Pictorial representation of PPO based Actor Critic Model deployment . . . . . . . . . 27

5.4 Pictorial representation of PPO based Actor Critic Model . . . . . . . . . . . . . . . . 28

3.1 security threats for smart grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 ISOT dataset attack type distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 ISOT dataset distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.1 Model Performance on ISOT CID and NSLKDD Dataset . . . . . . . . . . . . . . . . 30

6.2 Performance of DDQN CVAE model on daily changing attack type . . . . . . . . . . . 31

6.4 Performance of CVAE DDQN model on daily changing attack type . . . . . . . . . . . 31

Signature-based systems use a repository of signatures of already identified malicious patterns to

a lot of false alerts leading to a high false-positive rate (FPR).

performance, 3) low speed of adaptation due to lack of curiousity based learning .

Related work based on literature

chine learning technique.

not represent a true cloud dataset.

involving hypervisor. Moreover ,the dataset is not publicly available.