Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Dinesh BTP Thesis 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

CVAE DDQN and PPO Actor Critic based IDS for Smart

Grid Environment

Thesis Submitted in partial fulfillment of


the requirements for the degree of

Bachelor of Technology
in
Computer Science And Engineering

by
Dinesh Mohanty

Under the guidance of


Dr. Padmalochan Bera

SCHOOL OF ELECTRICAL SCIENCES


INDIAN INSTITUTE OF TECHNOLOGY
BHUBANESWAR
May 2021
©2021 Dinesh Mohanty. All rights reserved.
Certificate

Candidate’s Declaration

I hereby declare that the work presented in the thesis entitled ”CVAE DDQN and PPO Actor
Critic based IDS For Smart Grid Environment” in fulfillment of the requirements for the award
of the Degree of Bachelor of Technology and submitted in the School of Electrical Sciences of the
Indian Institute of Technology Bhubaneswar is an authentic record of my own work carried out under
the supervision of Prof. Dr. Padmalochan Bera, School of Electrical Sciences, Indian Institute of
Technology Bhubaneswar.

The matter presented in this thesis has not been submitted by me for the award of any other degree
of this or any other Institute/University.

Dinesh Mohanty
(Roll no. 17CS01051)

This is to certify that the above statement made by the candidate is true to the best of our knowledge
and belief.

Dr. Padmalochan Bera


Place: School of Electrical Sciences
Date: Indian Institute of Technology Bhubaneswar

i
Acknowledgement

It is a great honor to express my most profound respect and sense of gratitude to my B. Tech Project
supervisor Dr. Padmalochan Bera for his knowledge, insights, expertise, guidance, enthusiastic
involvement, and persistent encouragement during the planning and development of this thesis work.
I also gratefully acknowledge his meticulous efforts in thoroughly going through and improving many
of my research manuscripts without which this work could not have been completed.

I am highly obliged to all the professors of the Computer Science Department, including Dr.
Manoranjan Satpathy, Dr. D. P. Dogra, Dr. Joy Chandra Mukherjee, Dr. Srinivas
Pinisetty, and Dr. Sudipta Saha for providing all the guidance, help, and encouragement during
my last four years at college.

I am extremely grateful to my parents and grandparents for their moral support, love, encour-
agement, and blessings to complete this task. I am especially thankful to Kamalkanta Sethi for all
the support and mentoring, as well as providing excellent exposure in the field of research.

I thank all of my co-authors: Sai Prasath, Kamalakanta Sethi, Rahul Kumar, and Dr.
Padmalochan Bera for correcting and inspiring most of the contents of this thesis.

I would like to express my deep and sincere thanks to my friends and all other persons whose names
do not appear here, for helping me either directly or indirectly in all even and odd times.

Finally, I am indebted and grateful to the Almighty for helping me in this endeavor.

ii
Abstract

—The smart grid is a cyber-physical system that includes hardware, software and physical components
appropriately integrated, interacting and interrelating to sense the fluctuating state of the physical
world . Smart grid provides on-demand electricity to the customers from centralized and distributed
generation stations using information and communication technologies.Power companies can deliver
reliable power at reduced cost and can control the power demand.Security is considered one of the
most significant concerns in smart grid system.Smart grids have been able to improve and enhance the
capabilities of conventional power networks, but they make the latter more prone to cyber-attacks.
This may lead to the breakdown of integrity and confidentiality of the network. Intrusion Detection
System (IDS) has proven to be one of the significant ways of providing safe and robust services in a
smart grid environment. Through my work, I propose 2 advanced Reinforcement Learning based in-
trusion detection system frameworks for the smart grid by utilising the three-layer architecture of the
smart grid system. The proposed framework has an IDS in each HAN and NAN and many IDS sensors
in WAN. All the malicious activities will be sent to the central management unit which then corre-
lates and investigates alerts produced by various distributed sensors using anomaly-based detection
methodology. We have used robust and low false alarm creating IDSs based on Deep Reinforcement
Learning along with Generative models and Proximal Policy Optimisation Methods. Unfortunately,
there is a lack of smart grid-based Intrusion Dataset which has been compensated by famous con-
ventional network-based NSL-KDD Dataset and the cloud-based ISOT Dataset. Upon testing, the
proposed models showed excellent novel attack detection capabilities and significant performance met-
rics. Finally, the real world adaptiveness of our models was evaluated using changes in the attack
pattern on a day-wise basis and by checking our models against specific attack types.

iii
Contents

Certificate i

Acknowledgement ii

Abstract iii

List of Figures vii

List of Tables viii

1 Introduction 1

2 Related work based on literature study 3

3 Background 7

3.1 Double Deep Q Learning (DDQN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.1 Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1.2 Deep Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.3 Double Deep Q Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Curiosity Driven Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Proximal Policy Optimisation and Actor Critic methods . . . . . . . . . . . . . . . . . 10

3.3.1 Vanilla Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3.2 Trust Region Policy Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.3 Proximal Policy Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.4 Actor Critic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

iv
Contents

3.4 Smart Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.6 Types of attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.7 IDS specific to Smart Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.7.1 Intrusion Detection in Home Area Network . . . . . . . . . . . . . . . . . . . . 15

3.7.2 Intrusion Detection in Neighborhood Area Network . . . . . . . . . . . . . . . 15

3.7.3 Intrusion Detection in Wide Area Network . . . . . . . . . . . . . . . . . . . . 16

3.7.4 Deploying IDS in the Smart Grid . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Dataset Description 17

4.1 NSLKDD Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 ISOT-CID dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.1 overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2.2 Cloud environment and Data collection outline . . . . . . . . . . . . . . . . . . 18

4.2.3 Attack categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.4 Network Traffic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.5 File Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3.1 Tranalyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3.2 Processing of Tranalyzer output . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Proposed Intrusion Detection System 23

5.1 Our Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Architecture Modifications for PPO Powered Actor Critic Model . . . . . . . . . . . . 27

6 Experimental results and discussion 30

v
Contents

6.1 Performance of model on continuously changing attack types . . . . . . . . . . . . . . 30

6.2 Attack specific classification on NSL-KDD Dataset . . . . . . . . . . . . . . . . . . . . 31

7 Conclusion and Future Work 32

7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

8 Publications 34

9 References 35

vi
List of Figures

3.1 benifits of smart grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Layer architecture of smart grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 Distribution of labelled data in the training and testing dataset in NSLKDD . . . . . 17

4.2 Cloud Environment Architecture used for ISOT Data collection . . . . . . . . . . . . . 19

4.3 File Structure of ISOT CID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 Cloud IDS Deployment Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2 Pictorial representation for agent result calculation using DDQN and prioritised learn-
ing. CP1 , CP2 ,...CPk represents the classifier prediction in form of 1(for attack) and
0(for normal) for classifier C1 , C2 ,...Ck respectively . . . . . . . . . . . . . . . . . . . . 25

5.3 Pictorial representation of PPO based Actor Critic Model deployment . . . . . . . . . 27

5.4 Pictorial representation of PPO based Actor Critic Model . . . . . . . . . . . . . . . . 28

vii
List of Tables

3.1 security threats for smart grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 ISOT dataset attack type distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 ISOT dataset distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.1 Model Performance on ISOT CID and NSLKDD Dataset . . . . . . . . . . . . . . . . 30

6.2 Performance of DDQN CVAE model on daily changing attack type . . . . . . . . . . . 31

6.3 Performance of PPO Actor Critic model on daily changing attack type . . . . . . . . . 31

6.4 Performance of CVAE DDQN model on daily changing attack type . . . . . . . . . . . 31

6.5 Performance of PPO Actor Critic model on daily changing attack type . . . . . . . . . 31

viii
Chapter 1

Introduction

Today, electric power distribution is made possible by the power distribution grid; a system of trans-

mission mediums that allows electricity to be transferred from the point of generation to consumers

like homes, offices or industries. The electrical grid is expected to evolve to a new grid paradigm:

the smart grid that uses two-way flows of electricity and information to create an automated and

distributed advanced energy delivery network. A smart grid is an electricity network that can in-

telligently integrate the actions of all users connected to it – generators, consumers, and those that

do both – in order to optimize the production, supply as well as the consumption of electricity and

provide several features to its customers [1]. Security is considered one of the most significant con-

cerns in smart grid system. Smart grid domain can be divided into two main components: system,

and network. System component includes service providers, electricity utility operation center, smart

meter, electrical household appliances and renewable energy resources. Network component consists

three types of communication that are incorporated in smart grid operation: Home Area Network

(HAN) or Business Area Network (BAN); Neighborhood Area Network (NAN) and Wide Area Net-

work (WAN). HANs connect smart devices in the home with a smart meter and can communicate

using Bluetooth, ZigBee, Wireless, or Wired Ethernet. The NAN is a widespread network that col-

lects service and metering information from many HANs that are geologically adjacent to each other.

WAN provides wireless and wired communication between distributed grid, NANs, and the utility.

The WAN communication can be achieved using Fiber Optics, WiMAX or 4G/3G/GSM/LTE [3].

Smart grid improves and enhances the capabilities of conventional power network but at the same

time making it more vulnerable to different types of attack. These weaknesses permit an attacker to

breakdown integrity, confidentiality and allow access to the network. More severe vulnerabilities are

as follows: physical security; customer security; a greater number of intelligent devices; power system

lifetime; trust between traditional power equipment; different team’s background; inherit IP-based

vulnerability [2]. Cyber security is more exposed as a society become progressively dependent on

1
Chapter 1 - Introduction

computerized systems for industry, medicine, and finance, etc. Cyber security can be improved by

utilizing intrusion detection to look for unseen attack patterns and actions. Intrusion detection also

plays a significant role in computer forensics to identify successful breaches. Accident and undesired

conditions can also be detected using intrusion detection strategy. The earlier cyber-physical system

is considered safe but due to vast diversity domain of cyber-physical system, and everything connected

to the internet makes cyber-physical system unsafe. There are many attacks successfully launched

against cyberphysical systems. Hackers launched many malicious attacks on power networks. Several

security weaknesses exist in all kind of cyber-physical systems. A lot of research on going about smart

grid system implementation. Most of this research not focusing the security requirement for the smart

grid system. Reliability is one of the important designed issues in the development of the smart grid

system. IDS has engrossed importance in the smart grid system. Many kinds of literature considered

performance as detection ratio of intrusion detection, none of them covered about the latency and

delays concern of intrusion detection as some smart grid system would not effort delay after a certain

deadline.

Researchers also classify IDS as (1) signature-based systems and (2) anomaly-based systems [17].

Signature-based systems use a repository of signatures of already identified malicious patterns to

detect an attack. They are efficient in detecting known attacks but fail in case of unforeseen attacks.

In contrast, anomaly-based IDS detects intrusions when the set of activities of any user deviates from

its normal functionalities. Although these systems can detect a zero-day attack, they tend to generate

a lot of false alerts leading to a high false-positive rate (FPR).

In the last decade, many researchers proposed traditional machine learning and deep learning-based

IDS system that show excellent performance [5,12,25]. However, they also have several limitations that

include: 1) lack of proper adaptivity towards novel attacks and changes in attack patterns and their

ability to identify them with high accuracy and low False Positive Rate (FPR), 2) require frequent

human intervention for training that introduces more vulnerability and, thereby, affects the model’s

performance, 3) low speed of adaptation due to lack of curiousity based learning .

My research aims at dealing with all such issues through the use of Deep Reinforcement Learning

algorithms combined with intelligent flow analysis and smart experience replay powered by the newly

introduced Genarative Auto Encoders. I also propose the use of Proximal Policy Optimised Actor

Critic based Intrusion Detection System for the same. Our proposed IDSs can detect novel attacks and

adapt to every possible change in attack patterns in a smart grid environment ,with very little human

interaction. We also evaluated our proposed system using ISOT-CID , a real-time cloud intrusion

dataset, and NSL-KDD, a very popular conventional network dataset. The results show that our

proposed model effectively achieves right balance between accuracy and false-positive rate.

2
Chapter 2

Related work based on literature

study

In.this chapter, I present the related works on network intrusions for Cloud environments using ma-

chine learning technique.

Chiba et al. [5] proposed a NIDS framework for cloud environment. Their proposed NIDS framework

used cooperative and hybrid approach to detect network intrusions in cloud. It used SNORT (a

signature based technique) at the front end and back propagation neural network (an anomaly based

technique) at the back end in order to detect external and internal attacks respectively. NIDS deployed

on all processing servers work in a cooperative way to detect coordinated attacks by sharing alerts to

each other. However, authors did not validate the efficacy of their model using any network intrusion

dataset.

Li et al. [12] proposed a scalable and distributed NIDS architecture for cloud platform. The proposed

NIDS consisted various nodes and each node ran back propagation based ANN. The proposed NIDS

has been evaluated with KDD dataset on a physical cloud testbed. The experimental results show

average detection rate of 99% and average detection time of 37.1 second. The major limitation in this

work is with ANN which takes high training time for large dataset. Also the simulated dataset does

not represent a true cloud dataset.

K. Sethi et al. [21] presented a cloud NIDS using reinforcement learning. Their IDS can detect new

attacks in cloud and also adaptive to attack pattern changes in cloud. They validated the efficacy of

their model using a conventional network dataset (UNSW) instead of cloud network datasets. Their

model maintained a balance between accuracy and FPR. The main problem in their work is that they

3
Chapter 2 - Related work based on literature study

verified their model with UNSW dataset, which does not reflect a real cloud environment.

Kholidy et al. [10] created a new dataset called cloud intrusion detection dataset (CIDD) for cloud

IDS evaluation. The dataset includes both knowledge and behavior based audit data. To build the

dataset, the authors implemented a log analyzer and correlator system (LACS) that extracts and

correlates user audits from a set of log files from DARPA dataset. The main issue with this dataset

is that its main focus is on detecting masquerade attacks. Also it does not consider network flows

involving hypervisor. Moreover ,the dataset is not publicly available.

In summary, state-of-the-art works don’t apply Deep Reinforcement Learning for Cloud intrusion

detection system though a few recent attempts are present on conventional network applications. Also

the existing works do not use cloud specific datasets and thereby , are not capable of representing

real cloud environment. Aldribi et. al [4] introduced the first publicly available cloud dataset called

ISOT-CID which is collected from a real cloud computing environment. The dataset consists of a

wide variety of traditional network attacks as well as cloud specific attacks. The author discusses

a hypervisor-based cloud IDS involving novel feature extraction which obtains an accuracy (best

performance) of 95.95% with an FPR of 5.77% for Phase-2 and hypervisor-B portion of the dataset.

We have used CVAE-DDQN based model for detecting Intrusions with very high accuracy and low

FPR rates . Also our model shows highly robust traits.

Boumkheld et al [16] proposed an anomaly based IDS for Advanced Metering Infrastructure (AMI)

using AODV Protocol to detect blackhole attacks. It achieved 100% TPR, 99% accuracy, and 66%

Precision. They worked on simulated data. Faisal et al [13] had also proposed an anomaly based IDS

for AMI using MOA software to detect DOS, R2L, U2R, and Probing attacks. It achieved 94.67%

accuracy and 3.31% FPR. They worked on KDD CUP 1999 and NSLKDD dataset.

Goldberg et al [7] had designed an anomaly based IDS for SCADA component of the smart grid using

Modbus protocol and software tools such as wireshark, pcapy and Impacket to detect several attacks.

They were able to achieve 100% precision, 0% FNR, 100% accuracy and 0% FPR. They worked on self

generated real world dataset. Feng et al. [28] had also designed an hybrid IDS for SCADA component

of the smart grid using Profinet protocol and Snort software tool to detect reconnaissance, protocol

anomalies and DoS attacks.

Kwon et al. [26] had targeted the the smart grid substation using MMS and IEC 61850 protocol

and software tools such as wireshark to detect DoS, port scanning, portable executable, Goose, MMS,

and SNMP attacks. They were able to achieve 100% precision, 1.1% FNR, 98.9% TPR and 0% FPR.

They worked on real data from a substation in South Korea. The proposed IDS was specification

based. Yoo et al [27] had also designed anomaly based IDS for smart grid substation using MMS and

4
Chapter 2 - Related work based on literature study

GOOSE protocol and software tools such as the WEKA framework to detect several attacks. They

were able to achieve an average of 3.5% FPR. They also worked on real data from a substation.

Pan et al. [18] proposed an hybrid IDS for Synchrophasor component of the smart grid using Snort

and OpenPDC software tools to detect Single line to ground faults, Replay, Command Injection and

Disable Relay attacks. They were able to achieve 90.4% accuracy. They worked on a simulated dataset.

Yang et al [6] also targeted the Synchrophasor component of the smart grid using IEEE C37.118

protocol and software tools such as ITACA, Nmap, Metasploit and hping to detect Reconnaissance,

and DoS attacks. They were able to achieve 0% FPR. Their IDS was also specification based.

Rupam Kr. Sharma etc. in [4] analyzed various machine learning techniques for intrusion detection

system using KDD’99 dataset. High detection accuracy can be accomplished by using machine learning

techniques, however, due to poison learning in machine learning algorithms, there may exhibit some

weaknesses that might cause misclassification of network data. Intrusion detection approaches in

advanced metering infrastructure (AMI) was explained in [5].

The proposed state-based approach calculates security metrics using attack steps to achieve a high

degree of confidence in intrusion detection in AMI.

Two-tier intrusion detection framework was suggested in [6] for advanced metering infrastructure.

This structure achieved high detection rate with the low rate of false alarm.

Intrusion tolerance techniques used in [7] improve the availability of smart grid. The proposed

intrusion tolerance system evaluated in the event of DoS attacks and compared the results with two

existing intrusion tolerance system.

Yong Wang etc. in [8] proposed intrusion detection in SCADA to identify false data injection

attacks. Intrusion detection improves accuracy using new graph model. Several security objectives

are required to assure the safety of the smart grid system. To ensure confidentiality, the smart grid

should have the ability to avoid expose to the unauthorized system or individual. The smart grid

enforces confidentiality by encrypting data while in transit and restricting access to storage places of

these data. System breach occurs if the user data revealed in any way. The smart grid should protect

the channel between sensors, actuators, and controllers [9]. Sensors, actuators or controllers send and

receive information in a smart grid system. The smart grid system ensures integrity by detecting

and preserving information sent and received by these devices. To ensure integrity the smart grid

system required to have the capability to identify any changes that introduced in the message being

transferred [10]. Availability is more critical in smart grid automation, however, less significant in

smart metering applications. The smart grid system aimed to provide high availability services by

preventing denial of service attack, power outage, and hardware or system failure. Leon Wu, et al

5
Chapter 2 - Related work based on literature study

[11] described reliability framework for the smart grid system. The framework was working parallel

to evaluate the reliability of several stages and also provides concurrent feedback for the enhancement

of safety. Robert Mitchell, et al. [12] developed a probability-based model to analyze the effect of

intrusion detection and response on the reliability of the cyber-physical system. Robustness defines

the amount to which a structure is capable of working appropriately in the incidence of a disruption.

Matthias Rungger, et al in [13] introduce dynamic stability systems based on the bounded disturbance

and sporadic disturbance.

Trustworthiness describes the degree to which the system can be reliable and trustworthy to achieve

system tasks appropriately under well-defined environment circumstances in a specified period. Bjorn

Stelte, et al., in [14] propose an idea for malicious node detection and protection mechanism to assure

the trustworthiness of sensor data.

6
Chapter 3

Background

3.1 Double Deep Q Learning (DDQN)

In order to improve upon the architecture of existing model, I had to learn about the detailed working

of Deep Q Learning and its variants. In this section , I aim to present a high level overview of the

various algorithms ,starting from Q Learning ,which led to the development of the Double Deep Q

Learning algorithm.

3.1.1 Q Learning

Q Learning is one of the most famous Reinforcement Learning algorithms which uses Q (stands

for Quality) function to estimate reward values, which are used to provide the reinforcement. For

any Finite Markov decision process (FMDP), Q-learning identifies an optimal action selection policy

with the objective of maximizing the expected value of the total reward that can be obtained in the

successive steps (provided that it is given infinite exploration time and a partly-random policy) [23].

It uses Temporal Difference learning for this purpose. Temporal Difference value can be understood

as an estimate of the amount of reward that can be expected in the future. If the T.D. value is

very small, it means that the classifier has understood the environment well and there is little scope

for further improvement. Hence, the major goal is to minimize the T.D. values. The Equation for

updating Q value is as follows:-

Qnew (st , at ) ←− Q(st , at ) + α ∗ (rt + γ ∗ maxa (Q(st+1 , a)) − Q(st , at ))

7
Chapter 3 - Background Double Deep Q Learning (DDQN)

where,st is the state at time t, at is the action taken at time t, rt is the reward obtained at time

t, α is the learning rate , γ is the discount factor, Q(st , at ) is the old Q value, maxa (Q(st+1 , a) is

the estimate of optimal future value, [rt + γ ∗ maxa (Q(st+1 , a)] is the temporal difference target and

[rt + γ ∗ maxa (Q(st+1 , a)) − Q(st , at )] is the temporal difference equation

3.1.2 Deep Q Learning

A major limitation of Q-learning is that it works only in the environments that have discrete and

finite state-action spaces. In order to extend Q-learning to richer environments (where storing the

full state-action table is often infeasible), we use Deep Neural Networks as function approximators

that can learn the value function by taking just the states as inputs. The Deep-Q Learning is one

such solution for applying the concept of Q Learning in more complex environments. It uses Deep

Neural Networks to estimate the Q values of all possible actions at a given state. The loss function is

generally modeled to represent the Temporal Difference equation of Q Learning (as the objective of

reducing T.D. value holds true here as well).

loss = (r + γ ∗ maxa0 Q̂(s, a0 ) − Q(s, a))2

where,s is the state , a is the action taken, r is the reward obtained , γ is the discount factor and

Q̂(s, a0 ) is the delayed Q function. As we can see the loss function is very similar to the temporal

difference function in the case of Q Learning.

Deep Q Learning was also tested against classic Atari 2600 games [15] , where it outperformed

other Machine Learning methods in most of the games, and performed at a level comparable with or

superior to a professional human games tester.

3.1.3 Double Deep Q Learning

However, in their paper [24], Hado et al. explain the frequent overestimation problem found in Deep

Q Learning due to the inherent estimated errors of learning. Such overestimations related errors are

also seen in the Q Learning algorithm and were first investigated by Thrun and Schwartz [23]. They

showed that if the action values contain random errors uniformly distributed in an interval − then

each target is overestimated up to γ ∗  ∗ (m − 1)/(m + 1) where m is the number of actions. They also

gave an example in which such errors led to sub-optimal policies. Later van Hasselt (2010) showed

8
Chapter 3 - Background Curiosity Driven Variational Autoencoder

how noise from environment could lead to overestimations even while using tabular representation, He

proposed Double Q-learning as a solution , in which there is a decoupling of action selection and action

evaluation procedures that help lessen the overestimation problem significantly. Later, Hado et al. [24]

proposed a Double Deep Q Learning architecture that uses the existing architecture and deep neural

network of the DQN algorithm but finds better policies , thereby improving the performance [24].

They use the target network and the current networks to replicate the decoupling that was proposed

in the Double Q learning procedure.

YtDoubleDQN ⇐= rt + γ ∗ Q(st+1 , argmaxa Q(st+1 , a; θt ), θt− )

where,YtDoubleDQN is the Temporal Difference value at time t, st is the state at time t, at is the action

taken at time t, rt is the reward obtained at time t, γ is the discount factor, θt is weight matrix of

the Current Q Network at time t, θt− weight matrix of the Target Q Network at time t,

They tested their model on six Atari games by running DQN and Double DQN with 6 different

random seeds. The results showed that the overoptimistic estimation in the Deep Q Learning was

much more common and severe than what was previously acknowledged. In some cases the over

estimations were so high that log scale had to be used to show comparison with the optimal policy.

The results also showed that Double Deep Q Learning gave state of the art results on Atari 2600

domain.

We have employed the ’decoupling of action selection and action evaluation’ concept to build our

Double Deep Q Learning-based model and evaluated it on a real-world cloud dataset (ISOT-CID)

and a conventional network based NSL-KDD dataset. Evaluation suggests significant improvements

as compared to the previously proposed DQN model and other simple classifiers, as discussed further.

3.2 Curiosity Driven Variational Autoencoder

VAE is a generative model that can learn the unsupervised latent representations of complex high-

dimensional data [10]. The VAE model consists of two parts: encoder qφ (z|x) and decoder pθ (x|z)

. The encoder takes the input sample x and yields the input in latent space z. Then z is fed into

the decoder to predict back the sample x. The main principal of the VAE is to learn the marginal

likelihood of a sample x from a distribution that is parametrized by generative factors z. The marginal

likelihood of a data point x can take following form:

log pθ (x) = λ(x; θ, φ) + DKL (qφ (z|x)||pθ (z|x))

9
Chapter 3 - Background Proximal Policy Optimisation and Actor Critic methods

Since the true data likelihood is usually intractable, instead, the VAE optimizes an evidence lower

bound (ELBO) which is a valid lower bound of the true data log likelihood, denoted as:

λ(x; θ, φ) = E(log pθ (x)) − DKL (qφ (z|x)||pθ (z))

(4) λ(x; θ, φ) consists of two terms: the first term can be assumed as reconstruction loss, and the

second term is the approximated difference between the posterior qφ (z|x) from prior p(z) via KL-

divergence. In general, qφ and pθ are implemented via deep neural networks, and prior p(z) follows

Gaussian distribution N(0, 1).

The CVAE model [8] uses the prediction error as an intrinsic reward to drive the agent to make

a sufficient exploration, which can improve the quality of the generate training samples. It involves

another encoder that generates rt and st+1 given st and at . An error termet is therefore found out

which is defined as follows:-

et = DKL ((st+1 , rt )||(s0t+1 , rt0 ))

This error term is considered as an intrinsic reward and is added to the overall reward estimate. Such

curiosity driven estimator is used to improve the efficiency of our exploration. The loss function in

this case can be approximated as follows:-

λcvae = et − DKL (qφ (zt |st )||N (0, 1))

3.3 Proximal Policy Optimisation and Actor Critic methods

3.3.1 Vanilla Policy Gradient

Due to the low intuitiveness and high variance found in Q Learning based methods , Simple policy

Gradient methods ( where line search is conducted over the gradient of loss functions) are preferred.

A simple implementation of such algorithms is the Vanilla Policy Gradient. It was found to have two

major limitations - 1) It suffered from high variance due to variance in neural network estimations

of value function. 2) It used line search gradient descent policy which lead to lower performance as

compared to Trust Region based policies.

10
Chapter 3 - Background Proximal Policy Optimisation and Actor Critic methods

3.3.2 Trust Region Policy Optimisation

In 2015, TRPO introduces trust region strategies to RL instead of the line search strategy. The TRPO

add KL divergence constraints for enabling the trust-region for the optimisation process. It makes

sure that the new updates policy is not far away from the old policy or we can say that the new policy

is within the trust region of the old policy. It means policy update is not deviating largely. However ,

a major problem with this algorithm was high computational complexity arising due to second order

gradient calculations. [19] is the paper for reference. The following are the major equations here:-

 
πθ (a | s)
J (θ) = Es∼pπθold ,a∼πθ Âθ (s, a)
old πθold (a | s) old

Es∼pπθold [DKL (πθold (. | s) || πθ (. | s))] ≤ δ

3.3.3 Proximal Policy Optimisation

Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The

motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while

using only first-order optimization.

πθ (at |st )
Let rt (θ) denote the probability ratio rt (θ) = πθold (at |st )

, so . TRPO maximizes a “surrogate” objective:

 
CPI πθ (at | st ) h i
L (θ) = Êt )Ât = Êt rt (θ) Ât
πθold (at | st )

Where CPI refers to a conservative policy iteration. Without a constraint, maximization of LCP I

would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize

changes to the policy that move rt (θ) away from 1:

h  i
J CLIP (θ) = Êt min rt (θ) Ât , clip (rt (θ) , 1 − , 1 + ) Ât

where  = 0.2 is a hyperparameter, say, . The motivation for this objective is as follows. The

first term inside the min is LCP I . The second term, clip (rt (θ) , 1 − , 1 + ) Ât modifies the surrogate

objective by clipping the probability ratio, which removes the incentive for moving rt outside of the

interval [1, 1 + ] . Finally, we take the minimum of the clipped and unclipped objective, so the final

11
Chapter 3 - Background Smart Grids

objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme,

we only ignore the change in probability ratio when it would make the objective improve, and we

include it when it makes the objective worse. [20] is the paper for further reference.

3.3.4 Actor Critic Model

The Actor model performs the task of learning what action to take under a particular observed state

of the environment. The job of the Critic model is to learn to evaluate if the action taken by the

Actor led our environment to be in a better state or not and give its feedback to the Actor, hence

its name. It outputs a real number indicating a rating (Q-value) of the action taken in the previous

state. By comparing this rating obtained from the Critic, the Actor can compare its current policy

with a new policy and decide how it wants to improve itself to take better actions. [11] is the paper

for further reference.

3.4 Smart Grids

3.4.1 Description

Smart grids have the potential to deploy millions or even billions of intelligent components into

the electric grid that will communicate in diverse ways than previously thought of Kundur et al.

(2011).Besides, smart grid includes real time transfer of constant information and has numerous

points of passage. smart grid is required to create and allow access to a lot of client energy usage data

(CEUD) (Han and Xiao, 2016a).

The primary objective played by smart grid is energy efficiency, also, energy is created by the

traditional sources as alternating current (AC) which can’t be put away and in this manner is normally

sent and used right away. Accordingly, smart meters have been formed and consolidated into the

network to strengthen client energy awareness by empowering the client to watch and adjust use to

decrease energy expenses and subsequently productively manage available energy.

Smart grid consists of a generation system, transmission system, distribution system, and control

and data centre. The generation system is sometimes distributed and is often operated as part of a

microgrid.

The control and data centres perform advanced control methods such as distributed automation

in real time through two-way communication with the substation. Furthermore, a smart grid system

12
Chapter 3 - Background Benefits

contains intelligent grid systems that are intended to help implement self-healing capabilities into the

grid. The rationale behind self-healing is to detect problems in the grid early and addressing them

as soon as possible without human intervention. This makes smart grid resilient to attacks and as a

consequence increases availability and enhances reliability. Self-healing is needed for smart grid to be

able to redirect and adjust the flow of electricity through alternative paths if there is an interruption,

a task that can 172 J. Jow et al. only be achieved through constant self-assessment of the state of the

power system (Baumeister, 2010). Distributed generation within smart grid allows for the integration

of green energy sources such as solar or wind energy at various points in the grid enabling customers

to generate their power and also send excess energy back to the grid. Plug-in electric vehicles can

connect to the grid and charge or in some cases supply charge to the grid as deemed necessary (Han

and Xiao, 2015, 2016a, 2016d). Smart meters gather information on customer usage which is later

used to monitor efficiently and adopt to supply needs by anticipating peaks usage.

There are different technologies that play a vital role at the distribution level. A Smart Grid should

utilize these technologies in order to move the distribution system forward [10, 12]. This includes:

Advanced digital meters, Distribution automation, Low-cost communication systems, Distributed

energy resources, Broadband communications for distribution applications, Closed loop systems using

advanced protection, Distributed storage and generation, Real-time angle and voltage stability and

collapse detection, Reactive power control based on intelligent coordination controls, and Fault analysis

and reconfiguration schemes based on intelligent switching operations. A true Smart Grid will not

utilize these technologies as separate issues. Instead, it will integrate in order to maximize the benefits.

3.4.2 Goals

In order to design a Smart Grid, certain goals should be taken into account such as: observability;

create controllability of assets, enhance power system performance and security; and reduce costs of

operations, maintenance, and system planning.

3.5 Benefits

If these goals are achieved, many benefits can be harvested such as: Improved system performance

meters, Better customer satisfaction, Improved ability to supply information for rate cases; visibility

of utility operation / asset management, Availability of data for strategic planning, as well as better

support for digital summary, More reliable and economic delivery of power enhanced by information

flow and secure communication, Life cycle management, cost containment, and end-to-end power

13
Chapter 3 - Background Types of attacks

Figure 3.1: benifits of smart grid


delivery, and Impact access to historical data for strategic planning.

3.6 Types of attacks

Attack taxonomy includes traditional cyber-attacks and specific attacks that can be accomplished

across the smart grid system domain. Novice or irresponsible adversary can enter the network and

directly interrupt the anxious processes to cause a disaster. On the other hand, a more sophisticated

attacker may not disrupt the operation of the normal system to launch distributed attack [13]. This

type of attack is more difficult to detect.

Table 3.1: security threats for smart grids

Eavesdropping Masquerade
EM/RF Interception Repudiation
Media Scavenging Bypassing Controls
Indiscretions by Personnel Theft
Intercept/ Alter Authorization Violation
Replay Physical Intrusion
Virus/Worms Service Spoofing
Trojan Horse Man in the Middle
Trapdoor Integrity Violations
Cheating Customer Resource Exhaustion

14
Chapter 3 - Background IDS specific to Smart Grids

3.7 IDS specific to Smart Grids

Smart grid network is proposed of Home Area Network (HAN), Neighborhood Area Network (NAN),

and Wide Area Network (WAN).

Figure 3.2: Layer architecture of smart grid.

3.7.1 Intrusion Detection in Home Area Network

Home area network is the first layer of the smart grid comprises of the service component and metering

component. Service module provides energy consumption and cost. The metering module provides

consumer home energy consumption. There will be one IDS for every HAN. IDS will track inbound

and outbound communication to home area network to identify security breaches. At the time of

security breaches, IDS will notify home area network and send this information to central operational

network administrator of smart grid for further processing and necessary action.

3.7.2 Intrusion Detection in Neighborhood Area Network

Neighbourhood area network (NAN) is the second layer of the smart grid. The neighborhood area

network is a large network, which collects service and metering information of multiple HANs that

are geologically adjacent each other. The neighborhood area network consists of the smart meter

data collector (SMDC) and a central access controller (CAC). There will be one IDS for every NAN.

CAC act as an interface among HANs and energy supplier communication. The SMDC control all

metering record for the entire HANs in neighborhood area network. All incoming and outgoing data

15
Chapter 3 - Background IDS specific to Smart Grids

will be passed through the neighborhood area network IDS for possible security threats. At the time

of security breaches, IDS will send a notification to central operational network administrator of smart

grid for supplementary handling and essential action.

3.7.3 Intrusion Detection in Wide Area Network

Wide area network provides wireless and wired communication among distributed network devices,

substation, NANs. This layer also contains SCADA controller, energy distributed system (EDS). There

will be many IDS sensors at various location in the wide area network which are part of smart grid but

not included in NANs. EDS regulate metering data and energy distribution. SCADA provides control

to manage distribution grid elements. Data collected from IDS sensors in WAN will be correlated for

possible malicious activity or violation of any security policy. The network administrator of the smart

grid will be informed accordingly for additional processing and necessary action.

3.7.4 Deploying IDS in the Smart Grid

Our proposed framework includes several IDS sensors and central IDS management unit. Each HAN

would be protected by separate IDS device. This IDS is configured by the smart grid operator and

is responsible for detecting attacks in one HAN. Similarly, every NAN should be secured by separate

IDS. NAN IDS can detect any malicious activity in one NAN for all incoming and outgoing network

traffic. WAN contains many IDS sensors that audit communication between the energy generation,

transmission, distribution, and SCADA controller to maintain security in a smart grid environment.

IDS sensors in HANs, NANs, and WAN send alerts to a central operational IDS management system.

Central IDS management system is responsible for further processing of these alerts and provides

functionality for identifying malicious activity in the smart grid.

16
Chapter 4

Dataset Description

4.1 NSLKDD Dataset

The description of NSL-KDD dataset is given as follows. Each record of the dataset [47,48] consists of

41 features divided into four different categories having basic, content, traffic and host features. Each

record in the dataset is labelled as normal or a specific class of attack. The training dataset consists

of 23 traffic classes that include 22 attack classes and one normal class. The test dataset includes 38

attack classes out of which 16 are novel attack classes and one normal class [23]. The attack classes

are divided into four types, namely DOS, Probe, U2L, and R2L. The distribution of labelled data in

the training and testing dataset in NSLKDD is shown in 4.1.

Figure 4.1: Distribution of labelled data in the training and testing dataset in NSLKDD

17
Chapter 4 - Dataset Description ISOT-CID dataset

4.2 ISOT-CID dataset

4.2.1 overview

To evaluate our model we have used the ISOT Cloud Intrusion Dataset (ISOT-CID) , which is the first

publicly available cloud-specific [1] dataset. The data set was collected over the cloud infrastructure of

the ‘Compute Canada’ cloud service provider, that provides its services for supporting the computa-

tional needs of researchers [4]. The data was collected at various cloud layers of the OpenStack based

production environment , including hypervisor layer, guest hosts layer and the network layer. For our

purpose, we have used only the network traffic data portion of the ISOT-CID . The dataset consists

of data obtained from a variety of sources including network traffic, system logs, CPU performance,

memory dumps, system call traces ,etc. We have considered the phase 2 of the two phases that the

dataset was collected in. Aldribi et al. (2018) [29] contains description of the cloud platform, data

collection procedures, and Phase 1 dataset. Data collection in both the phases was made on the same

cloud environment and the same collection procedures were followed. However, the second phase of

collection occurred more recently and covered a wider variety of newer attack vectors .

4.2.2 Cloud environment and Data collection outline

The ISOT-CID collection environment contained three hypervisor nodes ( A, B, and C) . The cloud

environment also consists of 10 instances (V.M.1 - V.M.10) which were launched in three cloud zones

named A, B, and C (Fig. 3 shows the cloud environment). Five instances (VM2, VM3, VM4, VM5

and VM6) were launched in zone A, four instances were launched (VM7, VM8, VM9 and VM10) in

zone B, and one instance (VM1) was launched in zone C. 4.2 describes the structure.

The data was collected in the cloud for several days with time slots of 1–2 hours per day. Data

was collected with the help of various collector agents which were classified and integrated into the

three cloud layers : VM or Instance-based agents, Hypervisor-based agents and Network-based agents

. The data collected through these collectors were forwarded to the ISOT lab log server for storage

and analysis. The dataset contains both normal and malicious activities. The malicious data consists

primarily of attacks executed . However , there were some unsolicited malicious sessions that were

undertaken by unknown external hackers ,which were identified ,by the tiger team ,based on the source

IP addresses and attack timing. A wide variety of attack scenarios were covered including simultaneous

attack scenarios , coordinated attack scenarios, etc. Moreover , a wide variety of distinct geographical

locations in Europe, North America, and Asia were used for launching the attacks. The normal

data collected was also quite varied and complex involving 160 legitimate visitors ,ranging from data

18
Chapter 4 - Dataset Description ISOT-CID dataset

Figure 4.2: Cloud Environment Architecture used for ISOT Data collection
involved in maintaining the status of VMs, rebooting, updating, creating files, SSHing to the machine,

etc . The types of attacks are further mentioned in Table 4.1.

Table 4.1: ISOT dataset attack type distribution

Insider Attacks Outsider attacks


Trojan Horse Unclassified (unsolicited traffic)
Backdoor (reverse shell) DNS amplification DOS
Unauthorized Crypto- mining Ports and Network scanning
UDP Flood DOS Dictionary/Brute Force login attack
Stepping Stone Attack HTTP Flood DOS
Ports and scanning Network Directory/Path Traversal
Synflood DOS Dictionary/Brute Force login attack
Revealing Users and Confidential Data Fuzzers
Dictionary/Brute Force login attack Synflood Dos

4.2.3 Attack categories

The ISOT-CID malicious activities were divided into outside and inside attacks, based on whether they

were performed by outsiders or insiders respectively. The outside attacks comprised those that were

made from the outside world (by the tiger team or the unsolicited activities). The inside malicious

activities were perpetrated by either an insider within the cloud environment who had high previleges

on the hypervisor nodes or by a compromised VM within the cloud environment that was later as a

stepping stone for attacking other instances in the cloud or the outside world.. Some of the inside

attacks were network scanning, password cracking, backdoor and Trojan horse, DoS attacks, etc. [29]

19
Chapter 4 - Dataset Description ISOT-CID dataset

Figure 4.3: File Structure of ISOT CID


4.2.4 Network Traffic Data

The entire ISOT data set of size 8TB consisted of 55.2GB of network traffic data . The network

traffic data was composed of three levels of network communications, :-

• external traffic -traffic between the instances

• internal traffic or hypervisor traffic - traffic between the hypervisor nodes

• local traffic - traffic between two VMs on the same hypervisor node.

The collected network traffic data was stored in packet capture (pcap) format and made available

for public use. In phase 1, a total of 22,372,418 packets were captured out of which 15,649 (0.07%

) were malicious . Whereas in phase 2, a total of 11,509,254 packets were captured out of which

2,006,382 (17.43%) were malicious . The data collected were organised into folders based on the date

and hypervisor on which the data collection took place.

4.2.5 File Structure

The Logs directory contains two main directories, named based on the two collection phases, Phase

1 and Phase 2. Under each directory, there are 3-5 levels of subdirectories and files. The fourth level

of directories contains the dataset collected from both hypervisors and VMs stored under each attack

day separately. The fifth level of directories contains different directories for data types and sources

and named accordingly. The directory structure can be seen in 4.3 .

20
Chapter 4 - Dataset Description Preprocessing

4.3 Preprocessing

4.3.1 Tranalyzer

Due to the fact that packet payload processing involves huge amount and rate of data that have to be

processed, flow based analysis for intrusion detection are considered better for high speed networks

due to lower processing loads. [9]. However , the problem with such flow based analysis is high false

alarm rates[3]. To obtain flow based data from packet based data ,we have used this open source tool

called Tranalyzer which is a lightweight flow generator and packet analyzer designed for practitioners

and researchers. [2] It was used to process the pcap files and files containing multiple JSON files were

obtained as output. With the help of tranalyzer, we were able to get about 1.8GB of output flow

based JSON data from about 32.2GB of packet based input data in pcap format.

4.3.2 Processing of Tranalyzer output

All the 37 json object files which were output by Tranalyzer ,were parsed and extra fields were found

and removed. Each of the JSON object had different number of fields based on certain properties

which were peculiar to a given flow. Such extra fields were removed while parsing through the JSON

objects.A list of dictionary corresponding to the JSON objects was obtained. This list was used to

create the required CSV file. This CSV file was further processed to deal with lists of numbers and

lists of strings. Average of all the items of the list of numbers was used to replace such lists. Similarly,

the first string was used to replace the list of strings. Finally all the strings and hexadecimal values

(representing particular characteristics of flow) were one hot encoded for further improvement of the

training data. The values that weren’t integers or floating point numbers , were converted to ’Nan’

values. Chi Square feature seection was done. Finally the rows and columns having majorly ‘Nan’

values were removed from the dataset and the remaining ‘Nan’ values were replaced either with mean

of all other values in the column or with zeroes, based on the characteristics of the corresponding

feature. We finally labelled the data set based on the list of malicious IP addresses that was provided

along with the ISOT CID documentation [3]. There were 272419 non attack type tuples and 9883

attack type tuples. Finally , the data frame object was converted to Numpy Array to be used by our

models. After preprocessing we found out that our data set was highly skewed ,i.e, the number of

Non Attack samples was much higher than that of the number of attack samples. Hence, to prevent

biased learning, we selected a portion of the dataset which had a more balanced distribution having

9883 attack samples and 14824 normal samples. Table 4.2 shows the distribution of the dataset in

training and testing phase.

21
Chapter 4 - Dataset Description Preprocessing

Table 4.2: ISOT dataset distribution

Dataset Total Normal Attack


Training 17296 10377 6919
Testing 7411 4447 2964

22
Chapter 5

Proposed Intrusion Detection

System

In this section, we present our proposed intrusion detection system using deep reinforcement learn-

ing. Before presenting the components of our system, we introduce dataset and essential processing

elements related to our system.

5.1 Our Proposed Model

This section discusses the deployment architecture of our proposed cloud IDS. It includes broadly four

sub-components as host network, agent network, administrator network, and experience replay unit.

Figure 5.1 presents the DRL based cloud IDS architecture. The host network contains the running

VMs, hypervisors, and host machines. The agent network connects to the host network via Virtual

Private Network (VPN) taps that protects it from being compromised from an external attacker.

Also, the intruder can alter the system call traces to appear as normal so that the detection system

fails to identify the intrusion. Hence, the VPN communication should be quick to share all related

information before the attacker makes any manipulation.

The agent network obtains network packet information via VPN from host network and performs

mainly Flow based analysis, and necessary preprocessing on these data to extract feature vectors. The

major preprocessing and flow analysis steps have been explained already in section 7.

State: State in RL describes the input by environment to agent for taking actions. In our case the

23
Chapter 5 - Proposed Intrusion Detection System Our Proposed Model

Figure 5.1: Cloud IDS Deployment Architecture


state is described by the flow based features of the network flow at a given point of time.

Action: An action refers to the agent’s decision after monitoring the state of the cloud system

during a given time window. It applies a suitable policy on the output Q-values of the current deep

Q-network and obtains the agent result (refer Fig 5.2).

Algorithm 1: CVAE DDQN Logic


1 Initialise Replay Memory D with capacity N, Generate Replay Memory Dg with capacity Ng , Minibatch size M,
proportion factor g ;
2 Initialise the action value function and target value functions with weights theta and thetaT ;
3 for episode =1 to I do
4 Observe state s0 ;
5 for t =1 to I do
6 choose an action at based on epsilon- greedy policy;
7 Observe transition (st ,at ,rt ,st+1 );
8 store transition (st ,at ,rt ,st+1 ) in D;
9 sample random minibatch of transition (st ,at ,rt ,st+1 ) from D;
10 Generate transition (st ,at ,r’ t ,s’ t+1 );
11 compute the prediction error et ;
12 store transition (st ,at ,r’ t +betaet ,s’ t+1 ) in D g ;
13 Randomly sample M X(1 − G) of transition (sj ,aj ,rj ,sj+1 ) from D;
14 Randomly sample M X(G) of transition (sj ,aj ,rj ,sj+1 ) from Dg ;
15 if episode terminates at step j+1 then
16 yj ← rj ;
17 else
18 y j ← r j + γmaxa’ Q(sj+1 , a’ ; θ - );
19 end
20 Gradient Descent on (y − Q(sj+1 , a; θ))2 w.r.t. network paramenters θ;
21 in every C steps θ - ← θ;
22 end
23 end

Reward: A reward indicates the feedback from the environment about the action by an agent. In

our case the reward is calculated by subtracting the Q2 value from the Q1 value and it is multiplied

by -1 if the policy based decision deviates from the ground truth.

24
Chapter 5 - Proposed Intrusion Detection System Our Proposed Model

Figure 5.2: Pictorial representation for agent result calculation using DDQN and prioritised learning.
CP1 , CP2 ,...CPk represents the classifier prediction in form of 1(for attack) and 0(for normal) for
classifier C1 , C2 ,...Ck respectively

DDQN architecture In DDQN ,the major objective is to handle the overestimation of action

values that takes place in DQN. One of the major principles behind the DDQN is the decoupling of

action selection and action valuation components. This is achieved in our case by using two different

Neural Networks. One of them implements the current Q function while the other implements the

target Q function. Here, back propagation takes place in current Q Neural Network and its weights are

copied into the target Q Neural Network with delayed synchronization (copying is done after regular

intervals of a fixed number of epochs). In experimentation we have used epoch interval as 32 as it

was found to give optimal results in most of the cases. The actions (raising appropriate alarms) are

taken as per the current Q function but the current Qnew values are estimated using the target Q

function . This is done to to avoid the moving target effect when doing gradient descent. Similar

approach has been used by Manuel Lopez-Martin, Belen Carro, Antonio Sanchez-Esguevillas in their

work [14]. This method of delayed synchronization between two Neural Networks ensures the required

decoupling and thereby handling the moving target effect of DQN.

Here, we present our algorithms for intrusion detection. Algorithm 1 shows the working of the

CVAE along with DDQN in each target network update cycle. The function of the administrator

25
Chapter 5 - Proposed Intrusion Detection System Our Proposed Model

Algorithm 2: Administrator Network Logic


1 Get agent result and f eature vector from the agents;
2 for each agent do
3 p = number of bits set in agent result;
4 k = number of bits not set in agent result;
5 if p ≤ k then
6 status ← ”normal”;
7 else
8 status ← ”attack”;
9 end
10 if (status == ”attack”) then
11 pre-process feature vector for attack classification;
12 Input processed feature vector to classifier and get the attack type
13 attack type = output of classifier;
14 else
15 attack type = ”normal”;
16 end
17 Get the actual result from the environment;
18 Send the actual result to the agent for use in calculation of rewards;
19 end

network is shown in Algorithm 2. It uses a voting system to identify the presence or absence of an

attack.

Functional task of Cloud Administrator The cloud administrator runs algorithm 2 where it

monitors the activities of the cloud system constantly and detect its state. On receiving agent results, it

check for the intrusion and accordingly share the actual result to agent. It also identifies the location

of the intrusion including entry doors and target VMs. INterplay between the components

The Agent Network gets its input froom the Host Network. It conducts flow based analysis and

preprocessing before feeding the input to the DDQN moodel. The DDQN model predicts the output

. The actual result is obtained from the administrative network. Based on the result, the reward is

calculated .The input state along with the action , reward and output state is stored in a experience

pool. The CVAE generates tuples that get an additional intrinsic curious reward . Normal Experience

Replay tuples are considered with probability G and the generated experience tuples are considered

with probability 1-G . These tuples are then used to retrain the DDQN model.

26
Chapter 5 - Proposed Intrusion Detection
Architecture
SystemModifications for PPO Powered Actor Critic Model

5.2 Architecture Modifications for PPO Powered Actor Critic

Model

Figure 5.3: Pictorial representation of PPO based Actor Critic Model deployment

As can be seen in 5.3 , the architecture used for DDQN CVAE algorithm can be used for PPO AC

algorithm with minor modifications. We would no longer need the experience replay pool or CVAE

data pool. The resulting architecture is much simpler as a consequence. As can be seen in 5.4, we use

two neural networks in this case as well. However, unlike in DDQN CVAE , here back propagation

happens in both the neural nets- one for actor and other for critic. The output of the actor model

helps us decide the action to take 1.e whether to raise alarm or not. The critic helps find the value of

27
Chapter 5 - Proposed Intrusion Detection
Architecture
SystemModifications for PPO Powered Actor Critic Model

a given state and its output is used in Generalised Advantage Estimation algorithm. As can be seen

in 3, we train in episodes of 128 tuples. We get the output for 128 tuples first. We then calculate

the advantage values using the Generalised Advantage Estimation Algorithm. These inputs help us

calculate the ppo loss used for training the actor model. The critic model is trained using the Mean

Squared Error loss function. Although we have trained actor and critic model sequentially , they

can be trained in parallel as well. We can also run the GAE algorithm in a pipelined fashion as well

promoting further parallelism.

Figure 5.4: Pictorial representation of PPO based Actor Critic Model

28
Chapter 5 - Proposed Intrusion Detection
Architecture
SystemModifications for PPO Powered Actor Critic Model

Algorithm 3: Proximal Policy Optimised Actor Critic Model Logic


1 Initialise the actor NN function and Critic NN functions with weights θ and θ C ;
2 for episode =1 to I do
3 save policy pinew ;
4 for i =1 to 128 do
5 Observe state si ;
6 input input vector to actor and critic networks;
7 Get actor result using epsilon- greedy policy ai ;
8 Get the new state si+1 Get critic result ,v i ;
9 Get the reward r i ;
10 store the transition (si ,ai ,vi ,rt ,si+1 );
11 end
12 Initialize gae =0;
13 for i =127 to 0 do
14 δ = r t + γ ∗ v i+1 − v i+1 ;
15 gaet = δ + γ ∗ λ ∗ gaet+1 ;
16 end
17 ratio = π new /π old ;
18 Get loss for each tuple;
19 actor loss = min(ratio ∗ gae, clip(ratio, 1 − , 1 + ) ∗ gae);
20 Get Critic loss;
21 store policy piold ;
22 perform gradient update on actor and critic based on the losses;
23 end

29
Chapter 6

Experimental results and discussion

We implemented our models in python language and evaluated their performance on ISOT and NSL

KDD datasets. The experimental results include four standard machine learning performance metrics,

i.e., FPR (False Positive Rate), TPR (True Positive Rate), ACC (Accuracy), and AUC (Area under

ROC Curve). Although CVAE DDQN is an online learning model, while implementing it we have

used a part of the Datasets for the purpose of testing as well. we have implemented the preprocessing

to create lists of json objects and later converted it into csv files(on case of ISOT Dataset). Also the

flow based analysis was done before implementing CVAE-DDQN , but it can be done in parallel to

the CVAE-DDQN model. Similarly, while implementing ,we have implemented CVAE in sequence

with the DDQN algorithm but it can and will be run in parallel in the real world scenario. Similar

paralleisam can be conducted with the PPO Actor Critic Model as well Upon testing with datasets,

we obtain the following result:-


Table 6.1: Model Performance on ISOT CID and NSLKDD Dataset
Model Dataset Accuracy FPR AUC
CVAE DDQN ISOT-CID 98.16 1.56 0.896
CVAE DDQN NSL-KDD 89.20 1.77 0.8812
PPO A-C ISOT-CID 97.44 1.16 0.868
PPO A-C NSL-KDD 87.08 1.64 0.872
DDQN [22] ISOT-CID 96.87 1.57 0.886
DDQN [22] NSL-KDD 83.40 1.48 0.8432
Anomaly Based [13] NSL-KDD 82.1 - -

6.1 Performance of model on continuously changing attack

types

ISOT dataset collects the logs in a span of eighteen days where each day has new attack types. How-

ever, the majority of the volume belongs to first six days and each of these days has new attack type.

To understand the efficacy of model regarding adaptivness towards novel attacks, we did experimen-

30
Chapter 6 - Experimental results and discussion Attack specific classification on NSL-KDD Dataset

tation where model faces new attack constantly and note down its accuracy, FPR and AUC (refer

Table 4). The performance of any ith day is obtained by training the model to dataset belonging

from day 1 to day (i-1) and evaluating it on the i th day dataset. This is similar to situation in a

real-world where model would face novel attacks on each new passing day and its prediction would

depend on the learning from past. As can be seen from Table 4, our models perform fairly well even if

they are trained for a few days and tested on unknown attack types. The consistent improvements in

metrics like Accuracy, False Positive Rates and Area Under Curve ,in subsequent days, suggest high

adaptability and robustness in long term use.


Table 6.2: Performance of DDQN CVAE model on daily changing attack type

Day Attack Type No of samples ACC 1 FPR AUC


1 DTA 2and UCM 3 24622 - - -
2 NS 4 66124 83.11% 3.31% 0.8421
3 SQLI 5
,CSS 6, PT 7
,S-DOS 8 36517 90.16% 2.24% 0.8801
4 BFLA(failed) 9 43489 93.11% 2.32% 0.9301
5 UCM 10,DNSADOS 11 , HTTPFDOS 12
48716 94.10% 2.00% 0.9448
1:ACC: Accuracy, 2:DTM:Dictionary Traversal Attack, 3:UCM:Unauthorized Crypto-mining ,
4:NS:Network scanning, 5:SQLI:SQL Injection, 6:CSS:=Cross-site Scripting(XSS), 7:PT:Path
Traversal, 8:S-DOS:Slowloris DOS, 9:BFLA:Brute Force login attack 10:UCM:Unauthorized Crypto-
mining, 11:DNSADOS:DNS amplification DOS 12:HTTPFDOS:HTTP flood DOS

Table 6.3: Performance of PPO Actor Critic model on daily changing attack type

Day Attack Type No of samples ACC 1 FPR AUC


1 DTA 2and UCM 3 24622 - - -
2 NS 4 66124 81.24% 2.80% 0.8412
3 SQLI 5
,CSS 6, PT 7
,S-DOS 8 36517 89.99% 2.05% 0.8774
4 BFLA(failed) 9 43489 92.56% 1.99% 0.9211
5 UCM 10,DNSADOS 11 , HTTPFDOS 12
48716 93.01% 1.81% 0.9312
1:ACC: Accuracy, 2:DTM:Dictionary Traversal Attack, 3:UCM:Unauthorized Crypto-mining ,
4:NS:Network scanning, 5:SQLI:SQL Injection, 6:CSS:=Cross-site Scripting(XSS), 7:PT:Path
Traversal, 8:S-DOS:Slowloris DOS, 9:BFLA:Brute Force login attack 10:UCM:Unauthorized Crypto-
mining, 11:DNSADOS:DNS amplification DOS 12:HTTPFDOS:HTTP flood DOS

6.2 Attack specific classification on NSL-KDD Dataset

We have a variety of attacks that have been captured throught the NSL-KDD Dataset. We train and

test our models on attack specific selections from the dataset and present our result here.

Table 6.4: Performance of CVAE DDQN model


on daily changing attack type

Sl No Attack Type ACC FPR AUC


1 DOS 98.80 4.1 0.974
2 Probe 86.01% 11.41% 0.8421
3 R2L 88.41% 0.33% 0.8408
4 L2R 90.88% 4.08% 0.8891

Table 6.5: Performance of PPO Actor Critic


model on daily changing attack type

Sl No Attack Type ACC FPR AUC


1 DOS 96.10 3.4 0.945
2 Probe 88.01% 6.21% 0.8946
3 R2L 85.12% 0.16% 0.8025
4 L2R 88.84% 3.48% 0.8664

31
Chapter 7

Conclusion and Future Work

7.1 Future Work

In the future , I plan to work to build distributed , highly scalable IDS that can work very well in real

time , in real world scenarios. Also , a the data-sets and the experimentation’s can be further utilised

to get deeper insights . Also , the deployment and testing can be done on a real world smart grid

to get a realistic performance evaluation of our model and to create a smart-grid specific Intrusion

Detection System. Also I would like to make use of GAN based Reinforcement Learning model and

Rainbow DQN based model in future.

7.2 Conclusion

During my Btech project work under Dr Padmalochan Bera ,I worked on building advanced DRL

guided NIDSs that could provide very high accuracy and low FPR when evaluated on a smart grid-

specific environment. For this the major task was to find and pre process large volume of data that the

data sets carried. We could obtain a few 100 MB worth of extremely relevant data from about 8 TB

of data that was present in the ISOT CID dataset.We also worked to find out relevant data from the

NSLKDD dataset. Our aim was to meet the real-world constraints of limited processing resources and

to ensure the adaptability of our models towards novel attacks changing attack patterns. For this,

we did experimentation on the dataset with a flow-based technique that is computationally lighter.

We introduced the Double Deep Q Network-based IDS to handle the overestimation of action values

in the Deep Q Learning-based model. We combined the model with Curiosity Driven Autoencoders

to generate meaningful experiences to retrain our model with. We also proposed Proximal Policy

32
Chapter 7 - Conclusion and Future Work Conclusion

Optimised Actor Critic Reinforcement Learning Model for achieving similar accuracy with lower FPR

rates and much lower computational and logical complexity .Experiments show highly desirable results.

We also conclude that while both give excellent combined results, CVAE DDQN is more suitable for

higher accuracy while PPO Actor Critic is more suitable for lower FPR requirements. Section the

6.1 shows our systems ability to handle newer attack types (even with very little training data). The

experimentation results show high usability and effectiveness of the model for deploying in smart grid

platforms. We intend to deploy the proposed architectures in a practical smart grid environment and

evaluate their performance in the future. I played an integral role in all the processes that were a part

of the design and implementation of our models and take extreme pleasure in finding the encouraging

results of our models . Also I would like to make use of GAN based Reinforcement Learning model

and Rainbow DQN based model in future.

33
Chapter 8

Publications

1. [Published] Kamalakanta Sethi,Dinesh Mohanty Rahul Kumar, Padmalochan Bera, ”Robust

Adaptive Cloud Intrusion Detection System Using Advanced Deep Reinforcement Learning,”

2020 In book: Security, Privacy, and Applied Cryptography Engineering doi: DOI: 10.1007/978-

3-030-66626-2 4

2. [Accepted] Dinesh Mohanty, Kamalakanta Sethi,Sai Prasath, Padmalochan Bera, ” Intelli-

gent Intrusion Detection System for Smart Grid Applications ”, 2021 International Conference

on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA).

34
Chapter 9

References

Bibliography

[1] Isot cid website.

[2] Tranalyzer documentation.

[3] Onyekachi Nwamuo Paulo Magella de Faria Quinan Abdulaziz Aldribi, Issa Traore. Documenta-

tion for the isot cloud intrusion detection benchmark dataset(isot-cid), 2020.

[4] Abdulaziz Aldribi, Issa Traoré, Belaid Moa, and Onyekachi Nwamuo. Hypervisor-based cloud

intrusion detection through online multivariate statistical change tracking. Computers Security,

88:101646, 2020.

[5] Z. Chiba, N. Abghour, K. Moussaid, A. El omri, and M. Rida. A cooperative and hybrid network

intrusion detection framework in cloud computing based on snort and optimized back propaga-

tion neural network. Procedia Computer Science, 83:1200 – 1206, 2016. The 7th International

Conference on Ambient Systems, Networks and Technologies (ANT 2016) / The 6th International

Conference on Sustainable Energy Information Technology (SEIT-2016) / Affiliated Workshops.

[6] Y. Yang et al. ‘intrusion detection system for network security in synchrophasor systems,’. In

IET Int. Conf. Inf. Commun. Technol. (IETICT), pages 246–252, 2013.

[7] N. Goldenberg and A. Wool. ‘accurate modeling of modbus/tcp for intrusion detection in scada

systems,’. In Int. J. Critical Infrastructure Protection, volume 06, pages 63–75, 2016.

[8] Wang H. Mao CG. Han GJ., Zhang XF. Curiosity-driven variational autoencoder for deep q

network. In Lauw H., Wong RW., Ntoulas A., Lim EP., Ng SK., Pan S. (eds) Advances in

35
BIBLIOGRAPHY BIBLIOGRAPHY

Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science, vol

12084. Springer, Cham., pages 1–6, 2020.

[9] ALI AL MAZARI 2 HASHEM ALAIDAROS1, MASSUDI MAHMUDDIN1. An overview of flow-

based and packet-based intrusion detection performance in high speed networks. In Proceedings

of the International Arab Conference on Information Technology, 2011.

[10] H. A. Kholidy and F. Baiardi. Cidd: A cloud intrusion detection dataset for cloud computing

and masquerade attacks. In 2012 Ninth International Conference on Information Technology -

New Generations, pages 397–402, 2012.

[11] Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Society for Industrial and Applied

Mathematics, 42, 04 2001.

[12] Z. Li, W. Sun, and L. Wang. A neural network based distributed intrusion detection system on

cloud platform. In 2012 IEEE 2nd International Conference on Cloud Computing and Intelligence

Systems, volume 01, pages 75–79, 2012.

[13] J. R. Williams M. A. Faisal, Z. Aung and A. Sanchez. ‘data-streambased intrusion detection

system for advanced metering infrastructure in smart grid: A feasibility study,. In IEEE Syst.

J., volume 09, pages 31–44, 2015.

[14] Manuel López Martı́n, Belén Carro, and Antonio Sánchez-Esguevillas. Application of deep re-

inforcement learning to intrusion detection for supervised problems. Expert Syst. Appl., 141,

2020.

[15] Kavukcuoglu K. Silver D. et al. Mnih, V. Prioritized experience replay. In Nature 518, pages

1–6, 2015.

[16] M. Ghogho N. Boumkheld and M. El Koutbi. ‘intrusion detection system for the detection

of blackhole attacks in a smart grid’. In Proc. 4th Int. Symp. Comput. Bus. Intell. (ISCBI),

volume 01, page 108–111, 2016.

[17] S. Parampottupadam and A. Moldovann. Cloud-based real-time network intrusion detection

using deep learning. In 2018 International Conference on Cyber Security and Protection of Digital

Services (Cyber Security), pages 1–8, 2018.

[18] T. Morris S. Pan and U. Adhikari. ‘developing a hybrid intrusion detection system using data

mining for power systems,’. In IEEE Trans. Smart Grid, volume 06, pages 3104–3113, 2015.

[19] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust

region policy optimization, 2017.

[20] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy

optimization algorithms, 2017.

36
BIBLIOGRAPHY BIBLIOGRAPHY

[21] K. Sethi, R. Kumar, N. Prajapati, and P. Bera. Deep reinforcement learning based intrusion

detection system for cloud infrastructure. In 2020 International Conference on COMmunication

Systems NETworkS (COMSNETS), pages 1–6, 2020.

[22] Kamalakanta Sethi, Rahul Kumar, Dinesh Mohanty, and Padmalochan Bera. Robust adaptive

cloud intrusion detection system using advanced deep reinforcement learning. In Lejla Batina,

Stjepan Picek, and Mainack Mondal, editors, Security, Privacy, and Applied Cryptography Engi-

neering, pages 66–85, Cham, 2020. Springer International Publishing.

[23] S. Thrun and A. Schwartz. Prioritized experience replay. In M. Mozer, P. Smolensky, D. Touret-

zky, J. Elman, and A. Weigend, editors, Proceedings of the 1993 Connectionist Models Summer

School, Hillsdale, NJ, 1993. Lawrence Erlbaum), pages 1–6, 1993.

[24] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double

q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.

[25] Wei Xiong, Hanping Hu, Naixue Xiong, Laurence T. Yang, Wen-Chih Peng, Xiaofei Wang, and

Yanzhen Qu. Anomaly secure detection methods by analyzing dynamic characteristics of the

network traffic in cloud communications. Information Sciences, 258:403 – 415, 2014.

[26] Y. H. Lim Y. Kwon, H. K. Kim and J. I. Lim. “a behavior-based intrusion detection technique

for smart grid infrastructure,”. In IEEE Eindhoven PowerTech, volume 0, pages 1–6, 2015.

[27] H. Yoo and T. Shon. “novel approach for detecting network anomalies for substation automation

based on iec 61850,”. In Multimedia Tools Appl., volume 74, pages 303–318, 2015.

[28] X. Huo P. Pei Y. Liang Z. Feng, S. Qin and L. Wang. “snort improvement on profinet rt

for industrial control system intrusion detection,”. In 2nd IEEE Int. Conf. Comput. Commun.

(ICCC), pages 942–946, 2016.

[29] I. Traore A. Aldribi and B. Moa. Data sources and datasets for cloud intrusion detection modeling

and evaluation. 2018.

37

You might also like