Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Ftawu Tekola Proposal After DEfeince2 - For Merge

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 39

TECHNICAL VOCATIONAL TRAINING INSTITUTE

SCHOOLE OF POST GARDUATE STUDIES


FACULTY OF ELECTRICAL ELECTRONICS AND INFORMATION
COMMUNICATION TECHNOLOGY
Department of Information Communication Technology (ICT)

Prepared by: Fitawu Tekola


ETHIOPIAN TECHNICAL UNIVERSITY
Faculty of Electrical Electronics and Information Communication Technology

Department of Information Communication Technology (ICT)

Proposal title: “Improve Intrusion detection model using machine learning: the case of
Ethiopian WoredaNet network”

By: Fitawu Tekola

Advisor : Prof. Babu


Signature___

Addis Ababa, Ethiopia

March -2022
Declaration
I, Fitawu Tekola, the under signed, declare that this thesis entitled: “Improve Intrusion
detection model using machine learning: the case of Ethiopian WoredaNet network” is
my original work. I have undertaken the research work independently with the guidance and
support of the research advisor. This study has not been submitted for any degree or diploma
program in this or any other institutions and that all sources of materials used for the thesis
has been duly acknowledged.
Declared by

Name___________________________
Signature: ____________________
Department ____________________
Date __________________________
Name: <Fitawu Tekola Mekuria>

Title: <Improve Intrusion detection model using machine learning: the case of
Ethiopian WoredaNet network>

N Proposal Defense Comment Updates/Corrections you made Page


o no

1 Problem statement general With the rapid growth of the Internet and the
problem,not related to case area
ever-increasing security problems associated
with its popularity, the need for protection
3
against unwanted intruders has become very
important So intrusion detection system
widely applied in different organization.

 By using unauthorized access, hackers


and intruders can create many successful
attempts to cause the crash of the
networks and web services by
unauthorized intrusion

 Legitimate users are unable to access


information systems, devices, or other
network resources due to the actions of a
malicious cyber threat actor

 Termination of misuse and denial of the


service problems Some of the reasons are
the absence of proper, continual, and
consistent security mechanisms, which
leads the attacks and threats to occur.
2 Propose solution need The Selected Model implementation has the
modification following steps:

22_23

3 Sample attribute not mentioned The dataset is checked for the consistency of
individual attribute values and types, quantity
18
and missing values, using WEKA
prepossessing panel and Microsoft excel.

Local Examiner Name& Signature: _________________________ _________________

Expatriate Examiner Name& Signature: Dr Vikrant

Name of the Advisor: _________________________________ ________________


Certification
This is to certify that the thesis proposal prepared by Fitawu Tekola, entitled “Improve
security model using machine learning: the case of Ethiopian woredanet network” and
submitted in partial fulfillment of the requirements for the Degree of M.Sc in Department of
ICT with the regulations of the University and meets the accepted standards with respect to
originality and quality.

Name of Candidate: _____________; Signature: _______________Date: _____________.


Name of Advisor: _____________. Signature: _______________Date: _____________.

Signature of Board of Examiner`s:


External examiner: ____________________Signature: ____________Date: _____________.
Internal examiner: ____________________Signature: ____________Date: _____________.
Dean, SGS: __________________________ Signature: ____________Date:
_____________.
Abstract
Network and system security is of dominant importance in the present data communication
environment and becomes an essential component of Information and Communication
Technology (ICT) infrastructure, which protect intangible as well as tangible assets (i.e.
software and hardware) in public and private domain. With the widespread utilization of
information technology applications, organizations are becoming more prone to the security
threats for their resources with no matter how strict the security policies and mechanisms are
implemented. As network playing more important role in the modern society, crimes using
computer network are also becoming trends and accelerating. By using unauthorized access,
hackers and intruders can create many successful attempts to cause the crash of the networks
and web services by unauthorized intrusion. Intrusion detection systems are the ‘burglar
alarms’ (or rather ‘intrusion alarms’) of the computer security field. The aim is to defend a
system by using a combination of an alarm that sounds whenever the site’s security has been
compromised So, establishing a safe and strong Machine learning and data mining
techniques have been widely used in order to improve network intrusion detection in recent
years. These techniques make it possible to automate anomaly detection in network traffics.
In this Thesis, four supervised algorithms \Naive Bayes, Decision Tree, support vector
machine and Instance Based Learning) well be compare to enhance network traffic security
of prediction accuracy, detection rate and reduce false alarm rate at run time which is well be
applied on NSL-KDD dataset.

Keywords: Wrapper, Machine Learning, Accuracy, Detection Rate, False alarm rate
Table of Contents
Declaration............................................................................................................................................II
Certification.........................................................................................................................................III
Abstract.................................................................................................................................................V
List of Figure..................................................................................................................................VII
CHAPTER ONE...................................................................................................................................1
1. INTRODUCTION.............................................................................................................................1
1.1. Background................................................................................................................................1
1.2. Problem Statement.....................................................................................................................3
1.3. Objective....................................................................................................................................4
1.3.1. General Objective....................................................................................................................4
1.3.2. Specific Objectives..................................................................................................................4
1.4. Research Questions....................................................................................................................5
1.5. Motivated...................................................................................................................................5
1.6. Hypothesis of the Study..............................................................................................................5
1.7. Scope and Limitation of the Study.............................................................................................5
1.8. Limitations.................................................................................................................................6
1.9. Significance of the Study............................................................................................................6
1.10. Organization of the Thesis........................................................................................................7
CHAPTER TWO..................................................................................................................................8
2. Review of Related Literature.........................................................................................................8
2.1. Theoretical Literature Review....................................................................................................8
2.2. Empirical Literature Review......................................................................................................8
2.3. Intrusion Detection System Technologies................................................................................11
2.4. Classification of IDS based on Analysis type...........................................................................12
2.4.1. Signature-Based Detection....................................................................................................12
2.4.2. Anomaly-Based Detection.....................................................................................................13
2.5. Data mining and Intrusion Detection........................................................................................13
2.5.1. Clustering..............................................................................................................................14
2.6. Classification.............................................................................................................................15
2.7. Conceptual Framework............................................................................................................16
CHAPTER THREE.............................................................................................................................17
3. Research methodology (materials and methods).........................................................................17
3.1. Methodology............................................................................................................................17
3.2. Materials...................................................................................................................................18
3.2.1. Software used........................................................................................................................18
3.2.2. Hardware used.......................................................................................................................18
3.3. Data Collection.........................................................................................................................18
3.3.1. Data Preparation....................................................................................................................18
3.3.2. NSL-KDD Data Set...............................................................................................................18
3.4. Methods of Training and Testing.............................................................................................19
3.5. Methods of Analysis and Evaluation of System Performance..................................................20
3.5.1. Confusion Matrix..................................................................................................................21
3.6. Work Flow...............................................................................................................................22
3.7. Work break down (WBD)........................................................................................................24
3.7.1. Gantt chart for proposed thesis..............................................................................................25
3.8. Budget estimating for proposed thesis......................................................................................26
Ethical Considerations.....................................................................................................................26
Reference........................................................................................................................................27

List of Figure
Figure 2. feature selection method.....................................................................................................16
Figure 3: 10-fold cross validation........................................................................................................20
Figure 4: Work Flow of Model Selection.............................................................................................23
Figure5: Gantt chart............................................................................................................................25

List of Table
Table 1: work break down...................................................................................................................24
Table 2: Budget estimating for proposed thesis..................................................................................26
CHAPTER ONE

1. INTRODUCTION

1.1. Background
Network and system security is of dominant importance in the present data
communication environment and becomes an essential component of Information and
Communication Technology (ICT) infrastructure that protect intangible as well as
tangible assets (i.e. software and hardware) in public and private domain. With the
widespread utilization of information technology applications, organizations are
becoming more prone to the security threats for their resources with no matter how
strict the security policies and mechanisms are implemented. As network playing more
important role in the modern society, crimes using computer network are also become
trends and accelerating. By using unauthorized access, Hackers and intruders can create
many successful attempts to cause the crash of the networks and web services. So,
establishing a safe and strong network system that guarantee the security of information
requires a research focus [1] and new threats and associated solutions to prevent these
threats are emerging together with the secured system evolution [2].

Nowadays, network security devices are equipped with one or more security services
including Firewall, Intrusion Prevention/Detection Systems (IPS/DS), Data Leak
Prevention (DLP), and also provide Content Security Filtering Functions like anti-spam,
anti-virus or URL filtering. These services and functions have increasingly been
integrated into Unified Threat Management (UTM) systems or Next Generation (NG)
firewalls. Security devices need continuous testing to guarantee that the devices are
efficient, precise and useful, while simultaneously maintaining acceptable performance.
Network security is a critical concern for enterprises, government agencies and
organizations of all sizes[3].

WoredaNet is an Ethiopian Government Administrative ICT Network like HealthNet


and AgriNet, which is aimed to deliver an interconnected communication network for
different regional and federal government organizations as a service provider.

1
WoredaNet is a Wide Area Network intended to link all woreda administrative units in
Ethiopia that aims to build terrestrial and satellite-based network, connecting lowest
levels of government.

The country is divided into nine ethnically-based administrative regions (kililoch,


sign. kilil) which function as autonomous entities. The overall objective of the
WoredaNet ICT network is to deliver IP based services such as video conferencing,
directory, messaging, VoIP and Internet through the use of broadband terrestrial and
satellite-based network

Figure 1: Ethiopian Government WoredaNet Network Architecture [4]

WoredaNet ICT Network as shown in figure 1, in which all public sectors in the
country are to be connected to get e-Services from National Data Center. Based on
this initiative, there is one National Data Center and nine Regional Data Centers in
the country. The National Data Center is connected to the Regional Data Centers and
to all Woredas that have WoredaNet ICT Network in the country. Over 700 Woredas
(districts) are interconnected [5].

2
According to [4,5] there is several government applications for e-services already put
in place (and planned to be used) through the Woreda Information Systems (WIS).
There are thirteen ministries and nineteen agencies/offices are to be connected to the
WoredaNet; They will have independent servers locally and at national data center
and can be used for video conferencing, hosting; application, directory service, email,
voice over IP, internet service, Electricity Bill Information, Exchange Rates
Information, National Exam Student Result Information, etc.

According to [6]framework is a real or conceptual structure intended to serve as a guide


for building the architecture of the entire system. Security framework comes from a
variety of sources in order to addresses a number of objectives. Hence, the proposed
work is aimed to provide a framework used to identifying security problems in the data
centers of WoredaNet network and give a solution and guidance to prepare corrective
measure based on the network traffic analysis.

With the rapid growth of the Internet and the ever-increasing security problems
associated with its popularity, the need for protection against unwanted intruders has
become very important [7]. So intrusion detection system widely applied in different
organization.

 By using unauthorized access, hackers and intruders can create many successful
attempts to cause the crash of the networks and web services by unauthorized
intrusion

 Legitimate users are unable to access information systems, devices, or other


network resources due to the actions of a malicious cyber threat actor

 Termination of misuse and denial of the service problems Some of the reasons are
the absence of proper, continual, and consistent security mechanisms, which leads
the attacks and threats to occur.

3
1.2.Objective

1.2.1. General Objective


The general objective of this research is to Improve security model using machine
learning: the case of Ethiopian woredanet network

1.2.2. Specific Objectives


This research will have the following specific objectives:

To assess the existing security trends and improve security holes inside the data centers
on the WoredaNet network.

To improve security of Ethiopian WoredaNet network

To review literature on the concept of intrusion detection system in the area of data
mining.
To improve security model to analyze illegitimate activities in WoredaNet network.
To define system and users requirements.

To Improve IDS using machine learning capability.

To Analysis of network traffic through considering different security parameters


enables organizations to have an overall outlook of their data center security status .

To recognize the necessities of security systems to protect their computer networks and
reduce the risk of compromising their information.

1.3. Research Questions


As security is one of the issues, which need to be addressed in order to gain the intended
benefits of data centers. This research will address the following questions:

What is the possible intrusion detection model for Woredanet network Regional
state Data centers?

Which detection technique is the best to use as a model to enhance data center
security?

4
How to customize machine learning algorithm to improve IDS?

What is the problem of currently available in WoredaNet data center intrusion


detection model?

1.4. Motivated
The security issues were also addressed with standard software and hardware
implementations. But with all these security efforts, still security becomes an issue in
the WoredaNet network data centers, including the national data center which is the
gateway of the WoredaNet network. Thus, these scenarios motivated me to work and
contribute to the WoredaNet network security issue.

1.5. Hypothesis of the Study


This thesis is well be enhancement in intrusion detection accuracy with acceptable false
alarm rate using feature selection techniques of the Machine learning algorithm

1.6. Scope and Limitation of the Study


The survey is applied in the WoredaNet ICT infrastructure to suggest security
requirement and model to deal with security threat within. This work uses structured
questioner, interview and observation from the WoredaNet network data centers to find
and suggest security requirements as well as suggest security model solution of
intrusion detection (IDS).

This work did not look at other components of an Intrusion Detection System (IDS) like
data collection and response since it is very wide area of research.

The NSL-KDD dataset is collected for research purpose and used in a simulated
environment so; it might not be a perfect representation of the real threats. But shows
the effectiveness of anomaly based intrusion detection systems to overcome the
weakness of signature based intrusion detection systems.

1.7. Limitations
Limitations are boundaries that interface with the possibilities after a certain extent.

5
The limitations for this research is tools used for implementation is not fully free.

Literature's related to government services in Ethiopia especially woredanet is not


available on internet.

Lack of locally adapted technology standards for information technology systems is a


major concern that

Several major local languages are widely used in Ethiopia, which vary across the
Regions

This work uses structured questioner, interview, and observation from the WoredaNet
network data centers to find and suggest security requirements as well as suggest
security model solution of intrusion detection (IDS).need more a time to complete the
thesis and to get the expected objectives

1.8. Significance of the Study


This study is significant for the following :-

 Companies and home users

Network security is a high priority concern for companies and home users, because they
keep sensitive information on their computers, there is a great need to protect that
information from unauthorized access. Even though conventional approach to securing
computer systems against cyber threats (which use firewalls, authentication tools, and
virtual private networks) create a protective shield. It almost always has vulnerabilities.
There for something must be done in order to improve information security efficiency.

 Researcher

The research work has an explicit significance in development of knowledge for the
researcher and uses as a benchmark for interested researchers to explore the issues in
the area.

 Regional office

6
It almost always has vulnerabilities There for something must be done in order to
improve information security efficiency, so this work used to improve intrusion
detection system for Ethiopian WoredaNet network .

Network security is a high priority concern for Ethiopian WoredaNet network because
they keep sensitive information on their data center , there is a great need to protect that
information from unauthorized access. Even though conventional approach to securing
data center systems against cyber threats create a protective shield.

 Government

IF the woredanet network security is poor the user does not have confidence in the
agency, so this study can improve the security of the woredanet. in this case, the federal
and regional governments are confident by the user of woredanet so the government
may one of my studs significant used.

1.9.Organization of the Thesis


This thesis consists of five chapters. The first chapter deals with the general
introduction of the study including background of the study, statement of the problem,
objectives, scope and significance of the research. The second chapter is devoted to
literature review of the published works, journals and books and to gather information
relevant to the research work this gives background knowledge about network security
and intrusion detection, and intrusion dataset on machine learning algorithm and about
related work. The third chapter shows materials and methods that include methodology
of the study, software used, hardware used, data collection, methods of training and
testing. The fourth chapter is about result and discussion. Result of the experiments are
analyzed and interpreted in this chapter. Finally, chapter five presents conclusion that
summarizes major points of the research and recommendations have been forwarded for
further research.

7
CHAPTER TWO

2. Review of Related Literature

2.1. Theoretical Literature Review


To identify various malicious activities on the network, IDS has been subject matter of
many research studies from years ago. These studies include the detection and
prevention through novel IDS/IPS methods. Nowadays Machine Learning is also
become popularized techniques among researchers for detecting anomalies in the
network. Some of the similar related and recent researches has been perused and studied
by researchers.

Intrusion detection systems (IDS) are network security appliances that monitor network
and/or system activities for malicious activity. It can be any device/software which
exercises access control to protect computers from illegal exploitation. "Intrusion
prevention" technology is considered by some to be an extension of intrusion detection
(ID) technology, but it is actually another form of access control, like an application
layer firewall [7].

2.2. Empirical Literature Review


S.Revathi and A. Malthi [10] compared various machine learning techniques for
intrusion detection using NSL-KDD dataset. The authors selected 15 features out of 41
features then analyzed and compared with Random Forest, SVM, CART, J48 and Navie
Bayse algorithms. The authors argued that some of the features in the dataset are
redundant and irrelevant for the process. Correlation based Feature Selection (CFS)
subset is used to reduce the dimensionality of the data set. From the experiment it is
showed that Random Forests algorithm has high test accuracy compared to all other
algorithms with and without feature reduction of the dataset. An average performance
of 98.88% accuracy was observed for classes of Normal, DOS, Probe, U2R and R2L
with 15 selected features and an average performance of 97.94% accuracy was observed
for classes of Normal, DOS, Probe, U2R and R2L with all 41 features
R. S. Bapi et. al [12]proposed intrusion detection system using mutual information and

8
multi-layer neural network. In this work, simplified mutual information-based feature
selection algorithm selects the next features by considering only the recently selected
feature. The multi-layer neural network is trained on the selected features, and it
produced 99.34% of accuracy when the selected feature count is reached 19. Author
claimed that accuracy of the proposed model is more than the existing proposals that
they considered. Author also declared that proposed work reduced the computational
resources and time to train the model.

Pragyan D. and R. C. Jain [11]designed a combined approach of Intrusion Detection


System based on Machine Learning Technique. The authors proposed a hybrid
approach which is the combination of K-Medoids clustering and Nai've-Bayes
classification. The proposed work explores Naive-Bayes Classification and K-medoid
methods for intrusion detection and how it is useful for IDS. The authors argued that for
introducing Naive Bayes Classification is the involvement of many features where there
is no deviation between normal operations and anomalies. An experiment is carried out
to evaluate the performance of the proposed approach using their own created dataset.
They also claim that the proposed approach performed better in term of accuracy,
detection rate with reasonable false alarm rate

Tigabu Dagne Akal[29] Constructing a Supervised Model for Network Intrusion


Detection based on comparison of same algorithm the author proposed

The IDS models in this study are developed on full training Network Simulation
Language- Knowledge discovery in Database (NSL-KDD) dataset using a powerful
machine learning and data mining WEKA tool. The data mining model used in this
study is the KDD process. The KDD process refers to the whole process of changing
low level data into high level knowledge whose automated discovery of patterns and
relationships in large databases and data mining is one of the core steps in the KDD
process. The goal of KDD and DM is to find interesting patterns and/or models that
exist in databases but are hidden among the volumes of data (Fayyad et al., 1996). The
KDD process as described by Fayyad et al (1996) consists of five major phases. Data

9
were collected using appropriate algorithms then mined patterns were modeled

The proposed model would offer the advantage of considering those unlabeled records.
In this case there was a filling of only the top few most confident data points making
empty the class of rest records. Supervised learning is more suitable for intrusion
detection because they require a small quantity of labeled data while still taking
advantage of the large quantities of unlabeled data. Both the J48 decision tree algorithm
and the Naïve Bayes simple algorithm have been tested as a classification approach for
building a predictive model for intrusion detection
The result of the study has shown that the J48 decision tree algorithm with cross-
validation test mode and other default values is appropriate in the area of intrusion
detection.

An intrusion attempt or intrusion can be defined as the potential possibility of a


deliberate unauthorized attempt or action to access information, manipulate information
or render a system unreliable or unusableRecent Intrusion Detection Systems (IDSs)
which are used to monitor real-time attacks on computer and network systems are still
faced with problems of low detection rate, high false positive, high false negative and
alert flooding The author Presented a Neural Network-based approach that combined
supervised and unsupervised learning techniques.
[30]

2.3. Intrusion Detection System Technologies


As discussed in [14] IDS focus on identifying possible incidents. For example, IDS
could detect when an attacker has successfully compromised a system by exploiting
vulnerability in the system. The IDS could then report the incident to security
administrators, who could quickly initiate incident response actions to minimize the
damage caused by the incident. The IDS could also log information that could be used
by the incident handlers. As discussed in [15]many IDSs can also be configured to
recognize violations of security policies. For example, some IDSs can be configured

10
with firewall rule like settings, allowing them to identify network traffic that violates
the organization’s security or acceptable use policies. Also, some IDSs can monitor file
transfers and identify ones that might be suspicious, such as copying a large database
onto a user’s laptop

Many IDSs can also identify inspection activity, which may indicate that an attack is
forthcoming [16]. For example, some attack tools and forms of malware, particularly
worms, perform investigation activities such as host and port scans to identify targets
for succeeding attacks. An IDS might be able to block reconnaissance and notify
security administrators, who can take actions if needed to alter other security controls to
prevent related incidents. Because inspection activity is so frequent on the Internet,
reconnaissance detection is often performed primarily on protected internal networks. In
addition to identifying incidents and supporting incident response efforts, organizations
have found other uses for IDSs, including the following: Identifying security policy
problems, documenting the existing threat to an organization and deterring individuals
from violating security policies. Because of the increasing dependence on information
systems and the prevalence and potential impact of intrusions against those systems,
IDSs have become a necessary addition to the security infrastructure of nearly every
organization [15].

2.4. Classification of IDS based on Analysis type


There are mainly two approaches to the analysis of events for detecting attacks. These
are detection of signatures and detection of anomalies but some machine learning
approaches are included here.
The signature detection is the technique used by most commercial systems. The
anomalies detection, in which the analysis looks for unusual patterns of activity, has
been and remains under investigation [17] [18]

2.4.1. Signature-Based Detection


Signature-based detectors analyze system activities looking for events matching a
predefined pattern or signature that describes a well-known attack. They collect network

11
traffic and then proceed to analyze it [17]

The analysis is based on a comparison of patterns (pattern matching). The system


contains a database of attack patterns and will be looking for similarities with them and
when a match is detected the warning will be sent. The proper operation of such a
system depends not only on a good installation and configuration, but also on the fact
that the database where the attack patterns are stored is updated. The advantages of
signature based detection are: Signature detectors are very effective in detecting attacks
without generating a large number of false alarms. And they can quickly and accurately
diagnose the use of a specific attack technique. This can help those responsible for
security to easily follow security problems and to prioritize corrective actions. The
disadvantages of signature based detection are: Signature detectors only detect the
attacks they previously know, so they must be constantly updated with signatures of
new attacks. And many signature detectors are designed to use very tight patterns that
prevent them from detecting variants of common attacks [17].

2.4.2. Anomaly-Based Detection


The anomaly detection focuses on identifying unusual behavior in a host or a network.
They operate assuming that the attacks are different from the normal activity. Anomaly
detectors construct profiles representing the normal behavior of users, hosts or network
connections [18]. These profiles are constructed from historical data collected during
normal operation. From events, the detectors collect data and use a variety of measures
to determine when the monitored activity deviates from normal activity. The measures
and techniques used in the detection of anomalies include detecting a threshold on
certain attributes of user behavior. Such behavior attributes may include the number of
files accessed by a user in a given period of time, the number of unsuccessful attempts
to enter the system, the amount of CPU used by a process, and so on. This level can be
static or heuristic. Statistic measures, which can be parametric, where it is assumed that
the distribution of the profiled attributes fits a certain pattern, or non-parametric, where
the distribution of the profiled attributes is learnt from historical values observed over
time [18]. The advantages of anomaly based detection are: The IDSs based on anomaly

12
recognition detect unusual behavior. Thus they have the ability to detect attacks for
which they have no specific knowledge. And anomaly detectors produce information
that is very useful to define new patterns for signature detection. The disadvantages of
anomaly based detection are: The detection of anomalies produces a high number of
false alarms due to the unpredictable behavior of users and networks. And they require
very hard training to characterize patterns of normal behavior [18]

2.5. Data mining and Intrusion Detection


Data Mining is assisting various applications for required data analysis. Data mining is
recently becoming an important component in intrusion detection system. Different data
mining approaches like classification, clustering and association rule are frequently
used to analyze network data to gain intrusion related knowledge. The main advantage
of using machine learning is that, once an algorithm learns what to do with data, it can
do its work automatically.

2.5.1. Clustering
The amount of available network audit data instances is large, human labeling is time-
consuming and expensive so it requires clustering technique that is helpful to process,
label and assigning data into groups. Clustering algorithms can group new data
instances into similar groups. These groups can be used to increase the performance of
existing classifiers. High quality clusters can also assist human expert with labeling.
Clustering discovers complex intrusions occurred over extended periods of time and
different spaces, correlating independent network events[19]. The sets of data belonging
to the cluster are modeled according to pre-defined metrics and their common
features[20]. It is used to detect hybrids of attack in the cluster.

Clustering is an unsupervised machine learning mechanism for finding patterns in


unlabeled data with many dimensions. K-means clustering is used to find natural
groupings of similar alarm records. The records that are far from any of these clusters
indicate unusual activity that may be part of a new attack. The network data available
for intrusion detection is primarily categorical with attributes having a small number of
unordered values [21].

13
The steps involved in identifying intrusion in using clustering techniques are as follows:
1. Find the largest cluster, i.e., the one with the most number of instances, and
label it normal.
2. Sort the remaining clusters in an ascending order of their distances to the largest
cluster.
3. Select the first K1 clusters so that the number of data instances in these clusters
sum up to A 'N, and label them as normal, where ' N is the percentage of normal
instances.
4. Label all the other clusters as attacks.

Unlike traditional anomaly detection methods, they cluster data instances that contain
both normal behaviors and attacks, using a modified incremental k-means algorithm.
After clustering, heuristics are used to automatically label each cluster as either normal
or attacks. The self- labeled clusters are then used to detect attacks in a separate test
dataset [19].

2.6. Classification
An intrusion detection system that classifies audit data as normal or anomalous based
on a set of rules, patterns or other affiliated techniques can be broadly defined as a
classification-based intrusion detection system.

There is similarity in classification and clustering in that it also partitions customer


records into distinct segments called classes. However unlike clustering, classification
analysis requires that the end-user/analyst know ahead of time how classes are defined.
It is required that each record in the dataset used to build the classifier already have a
value for the attribute used to define classes. Every record has a value for the attribute
used to define the classes. To decide how new records should be classified is the
objective of a classifier but not to explore the data to discover interesting segments.
Classification is used to assign examples to predefined categories. Machine learning
software performs this task by extracting or learning discrimination rules from
examples of correctly classified data [26].

14
Data classification for intrusion detection can be achieved by the following basic steps
[27]

1. In order to a machine learning program to learn the classification models of the


normal and abnormal system call sequences, it needs to be supplied with a set
of training data containing pre-labeled normal and abnormal sequences.
Different mechanisms based on either linear discrimination, decision tree or
rule based methods can be used to scan the normal network traces and create a
list of unique sequences of system calls. This list is generally named as normal
list.

2. Next step is to scan each of the intrusion traces. For each sequence of system
calls, first look it up in the normal list. If an exact match can be found then the
sequence is labeled as normal otherwise it is labeled as abnormal.

3. Then ensure that the normal traces include nearly all possible normal short
sequences of system calls. The Intrusion trace contains many normal
sequences in addition to the abnormal sequences since the illegal activities
only occur in some places within a trace.

2.7. Conceptual Framework

15
Figure 2. feature selection method

CHAPTER THREE

3. Research methodology (materials and methods)

3.1. Methodology
Both qualitative and quantitative methods will be used to conduct the study. NDC
(National Data Center) and Regional Data Centers will be conducted under the study.
Applications software’s and system software’s that runs on Woredanet VPN will be
conducted. National data center and some Regional bureaus and sample Woredas that
are connected to woredanet VPN are also part of the study.

The methodology of this research work is divided in to two phases in order to achieve
the general and specific objectives.

1. Issue Based Analysis Technique is used for surveying the existing security issues in
the WoredaNet network by using structured questionnaires, structured open
interviews and observation.

2. proposed technical solution is modeled and implement will be using relevant


features selected by Wrapper-Best First technique for four algorithms such as Nave
Bayes, Instance Based Learning, Decision Tree and Support Vector Machine then
evaluated by 10 fold cross validation, and in the final phase, the four algorithms
such as Nave Bayes, Instance Based Learning, Decision Tree and Support Vector
Machine are trained using training dataset then evaluated by test dataset on the

16
selected features.

The primary information is obtained through structured questionnaire, structured open


interviews and observation. Secondary data source like, books, reports, journal, thesis,
conference articles and white paper from the websites of reliable authors and
organizations have been used to get information about Ethiopian WoredaNet
infrastructure and intrusion detection system .

3.2. Materials

3.2.1. Software used


In this thesis work, Waikato Environment for Knowledge Analysis (WEKA) V.3.8
software, which is a data mining software written in Java, is used. It is intended to aid in
the application of machine learning techniques to a variety of real-world problems that
contains a collection of visualization tools and algorithms like data prepossessing,
clustering, classification, regression, visualization and feature selection for data analysis
and predictive modeling, together with graphical user interfaces for easy access to these
functions.

3.2.2. Hardware used


Personal computer (PC) with core TM i7, CPU @2.7 GHz, RAM 8GB and Windows 10
with 64 bits operating system is used for data analysis purpose and External hard disks
(1-Tera) and Rewriteable Compact Disks also used for back up purpose.

3.3. Data Collection


To assess the existing WoredaNet security issues, survey data well be collect from three
regional data centers and one national data center.

3.3.1. Data Preparation


The dataset is checked for the consistency of individual attribute values and types,
quantity and missing values, using WEKA prepossessing panel and Microsoft excel.
Then, the data format is changed into .arff file format that can be understood by the

17
WEKA tool. For the feature selection analysis, Ten-Fold Cross Validation Evaluation
technique is used.

3.3.2. NSL-KDD Data Set


This thesis well be used NSL-KDD intrusion dataset for research purpose. NSL-KDD
dataset has two very important advantages that motivate the researchers to use this
dataset. Those are as follows: First, it does not include redundant records in the train set,
so the classifiers will not be biased towards more frequent records. Second, the number
of selected records from each difficulty level group is inversely proportional to the
percentage of records in the original KDD data set.

The NSL-KDD data set has four categories of attack classes. These are:

DoS: the attacker does not allow legitimate users access to computing resources
or overloads them so that requests cannot be processed in real time. The result of
this attack is the unavailability of resources, i.e. resources are too busy or too full
to serve legitimate networking requests and hence denying users access to a
machine.

1. Probing: Surveillance and other probing attack’s objective is to gain


information about the remote victim e.g. port scanning. Relevant features:
“duration of connection” and “source bytes”
2. U2R: Unauthorized Access to Local Super User (root) privileges is an attack

18
type, by which an attacker uses a normal account to login into a victim system
and tries to gain root/administrator privileges by exploiting some vulnerability
in the victim e.g. buffer overflow attacks. Relevant features: “number of file
creations” and “number of shell prompts invoked,”
3. R2L: Unauthorized access from a remote machine, the attacker intrudes into a
remote machine and gains local access of the victim machine. E.g. password
guessing Relevant features: Network level features - “duration of connection”
and “service requested” and host level features - “number of failed login
attempts” [21] [22].

3.4. Methods of Training and Testing


The selected algorithms are evaluated by cross-validation for feature selection purpose
using the 10 fold cross validation technique. In 10-fold cross-validation, the original
sample is randomly partitioned into 10 equal sized sub-samples. Of the 10 sub-samples,
a single sub-sample is retained as the validation data for testing the model, and the
remaining 10 - 1 sub-samples are used as training data. The cross-validation process is
then repeated 10 times, with each of the 10 sub-samples used exactly once as the
validation data. The 10 results can then be averaged to produce a single estimation. The
advantage of this method over repeated random sub-sampling is that all observations are
used for both training and validation, and each observation is used for validation exactly
once. Therefore, k-folds minimize the bias effects by random sampling of the training
[23] and holdout data samples through repeating the experiments k times. See the
following figure 3 that shows k-fold cross-validation with k=10.

19
Figure 3: 10-fold cross validation

For performance evaluation of model development preprocessing of re-sampling


techniques is used to prepare reasonable dataset size for training and testing. Thirty
present of NSL-KDD dataset is used for model development process. In training and
testing model development process separate training and testing data sets is used with
seventy five percent by twenty five percent proportions respectively.

3.5. Methods of Analysis and Evaluation of System Performance


Once features are selected using wapper-BestFirst-J48 feature selection technique, then
the selected classifiers performance of overall accuracy, detection rate and false alarm
rate is compared. These performance measures tells us how frequent instances of
particular classes are correctly classified as normal or misclassified as anomalous
classes of the NSL-KDD network traffic dataset.

3.5.1. Confusion Matrix


The confusion matrix is useful tool for analyzing how well classifier recognized the
classes. It is body of table with m-by-m (row and column) matrix the row corresponds
to correct classification and the column corresponds to the predicted classifications. For
a classifier to have good accuracy, ideally, most of the tuples would be represented

20
along the diagonal of the confusion matrix with the rest of the entries being closed to
zero [24]. In confusion matrix, there are classifier evaluation metrics like accuracy,
detection rate, false alarm rate, precision, recall, and F-measure. Table 7 shows two-
class classification result simple confusion matrix that contains both predicted and
actual classes of normal and anomalous traffics of the dataset.

Predicted Class Actual Class


Normal Anomalous
Normal TP FP
Anomalous FN TN
Confusion Matrix with two class’s classification Result

Key: TN= True Negative, TP =True Positive, FN =False Negative, and FP =False
Positive

Supervised Machine Learning (ML) has several ways of evaluating the performance of
the classifiers [25]. Here are some of performance evaluation computational techniques
on confusion matrix that are used in this study. Accuracy is the first one which is widely
used to check the performance of the model. It is the percentage of test set tuples that
are correctly classified [42]. Accuracy = (TP + TN)/

(TP+TN+FP+FN………………………………………………………………………..
(1)

Detection Rate (DR): Detection Rate is also referred as True positive rate i.e., the
proportion of positive tuples that are correctly identified.
DR = TP/ (TP+FN)……………………………………………………………………(2)

False Alarm Rate (FAR): False Alarm rate is the False positive rate that is the
proportion of negative tuples that are correctly identified.
FAR= FP/ (FP+TN)……………………………………………………………………(3)

21
3.6. Work Flow
This sub section describes about abstraction of the work flow and work flow of model
selection. In the process, first the existing WoredaNet network security system is
surveyed using structured questionnaires and then security problems are identified from
the existing WoredaNet data centers. Finally a technical solution is selected using
machine learning.

The model consists of four phases: Data collection phase, Data PreProcessing phase,
Feature selection phase and Analysis phase. In first phase data is collected, which is
prepossessed and become usable in second phase. Further, in the third phase the
relevant features are selected by Wrapper-Best First technique algorithms Decision Tree
and Support Vector Machine then evaluated by 10 fold cross validation, and in the final
phase, the four algorithms such as Decision Tree and Support Vector Machine are
trained using training dataset then evaluated by test dataset on the selected features.

The Selected Model implementation has the following steps:

1.Input dataset D

2. Pre process the dataset D

Re-sample the dataset

Discretization

3.Selecting feature with optimum information using a selected method Subset generation

Subset evaluation

Subset validation using 10 fold cross validation

Set the dataset with selected features

4. Apply the Selected algorithm on dataset with selected features

5. Evaluate the model performance

6. Output: Classify the dataset as ‘Normal’ or ‘Attack’

22
23
Figure 4: Work Flow of Model Selection

WBS Task Name Starting date Ending date Duration WBS


Predecesso
0 ISM - Research Project Dec 01/2021 Jul 10/2022 220days rs

1 Research Objective Framing Dec 01/2021 Dec 20/2021 20 days

2 Literature Review Dec 21/2021 Feb 30/2022 70 days


2.1 Collecting previous literature Dec 21/2021 Jan15/20222 25days 1

2.2 Review of literature Jan 16/2022 Feb 05/2022 20days 2.1

2.3 Key Points Analysis Feb 06/2022 Feb 20/2022 15days 2.2

2.4 Hypothesis Planning Feb 21/2022 Feb 30/2022 10days 2.3

3 Research Methodology Mar 01/2022 May 20/2022 80days

3.1 Qualitative Analysis Mar 01/2022 Mar 22/2022 22 days 2.4


3.2 Quantitative Analysis Mar 23/2022 App 12/2022 20 days 3.1

3.2.1 Questionnaire Preparation App 13/2022 App 27/2022 15 days

3.2.2 Conducting Survey App 28/2022 May 10/2022 13days 3.2.1

3.2.3 Collating Survey results May 11/2022 May20/2022 10 days 3.2.2

4 Data Analysis May 21/2022 Jun 05/2022 15 days 3.2.3

5 System Proposal Jun 06/2022 Jun25 /2022 20days 4

6 New System Evaluation Jun 26/2022 Jun10/2022 15 days 5

3.7. Work break down (WBD)


Table 1: work break down

24
3.7.1. Gantt chart for proposed thesis

25
Figure5: Gantt chart

26
3.8. Budget estimating for proposed thesis
Financial (budget) Requirements of the proposed Research
No Estimate Estimated Remark
Item Quantity unit price Total
in Birr Price in
Birr
1 Local travel 5,000 5,000
Data collecting 3,000 3,000
2 Transportation 3,200 3,200
3 For internet access 2,000
4 Field subsistence 4,500
5 External HDD 1TB 1 4000 4,000 To store file & for backup
6 MATLAB software training 1 120 1,800
7 USB flash 64 GB 1 4,00 4,00 For transferring file
8 Consumable Different type 2,000
9 Printing 2 400 8,00
11 Total 26,700
12 Contingency (5%) 1335
13 Grand Total 28035

Table 2: Budget estimating for proposed thesis

Ethical Considerations
This research work guarantees the quality and integrity of The security issues also
addressed with standard software and hardware implementations as well as privacy.
Security audit of organizations is conducted based on their security solutions, policies,
standards, processes and procedures that exist in the different security management
domains as defined by International Standard Organization (ISO).

27
Reference
[1] C. M. and W. C, “A Study of Intrusion Detection System Based on Data Mining”
Institute of Electrical and Electronics Engineers (IEEE) 978-1-4244-6943-7/10/
© IEEE,” 2010.

[2] V. Das, V. Pathak, S. Sharma, Sreevathsan, M. Srikanth, and T. Gireesh Kumar,


“Network Intrusion Detection System Based On Machine Learning Algorithms,”
Int. J. Comput. Sci. Inf. Technol., vol. 2, no. 6, pp. 138–151, 2010, doi:
10.5121/ijcsit.2010.2613.

[3] D. Schneider, “The state of network security,” Netw. Secur., vol. 2012, no. 2, pp.
14–20, 2012, doi: 10.1016/S1353-4858(12)70016-8.

[4] S. Lessa, Lemma; Belachew, Mesfin; and Anteneh, “Sustainability of E-


Government project Success: Cases from Ethiopia,” 2011.

[5] et al [4] Lessa, “Acceptance of WoredaNet E-Government Services in Ethiopia:


Applying the UTAUT Model,” 2011.

[6] H. J. Schumacher and S. Ghosh, “Network Security Framework,” Wiley Encycl.


Electr. Electron. Eng., vol. 6, no. 7, pp. 151–157, 1999, doi:
10.1002/047134608x.w5338.

[7] the free encyclopedia Wikipedia, “Intrusion Prevention System.”

[8] S. R. and A. Malathi, “A Detailed Analysis on NSL-KDD dataset with various


Machine Learning Techniques for Intrusion Detection.”

[9] L. D. and S. P. Shantharajah, “A Study on NSL-KDD Dataset for Intrusion


Detection System Based on Classification Algorithms.”

[10] R. S. B. et Al., “Mutual Information-Based Intrusion Detection System Using


Multilayer Neural Network,” 2019.

[11] P. Diwan and R. C. Jain, “A Combined Approach for Intrusion Detection System
Based on the Data Mining Techniques,” ISSN || Int. J. Comput. Eng. Res., vol.

28
04, no. 6, pp. 2250–3005, 2014, [Online]. Available: www.ijceronline.com.

[12] T. D. Akal, “Constructing a Supervised Model for Network Intrusion Detection,”


vol. 9, no. 1, pp. 50–67, 2016.

[13] Z. S. Stefanova, “Machine Learning Methods for Network Intrusion Detection


and Intrusion Prevention Systems,” ProQuest Diss. Theses, p. 106, 2018,
[Online]. Available:

[14] C. Endorf et al., “Intrusion Detection & Prevention,” 2004.

[15] J. Kizza and F. Migga Kizza, “Intrusion Detection and Prevention Systems,”
Secur. Inf. Infrastruct., pp. 239–258, 2011, doi: 10.4018/978-1-59904-379-
1.ch012.

[16] C. Douligeris and A. Mitrokotsa, “DDoS attacks and defense mechanisms: A


classification,” Proc. 3rd IEEE Int. Symp. Signal Process. Inf. Technol. ISSPIT
2003, no. June 2014, pp. 190–193, 2003, doi: 10.1109/ISSPIT.2003.1341092.

[17] J. Burton, “Cisco security professional’s guide to secure intrusion detection


systems,” 2003.

[18] O. S. Guide, ( ISC ). .

[19] วัน ชัย วัฒ นศัพ ท์, “No Title การมีส ่ว นร่ว มกับ การพัฒ นาองค์ก ร บรรยายในการ
สัมมนานายจ้างและลูกจ้า งภาครัฐวิสาหกิจ เรื่อง ระบบทวิภาคีกับการแก้ไขปั ญหา
แรงงานในรัฐวิสาหกิจ : กอรัฐวิสาหกิจสัมพันธ์ กรมสวัสดิการและคุ้มครองแรงงาน
กระทรวงแรงงาน,” 2546.

[20] M. A. M. Hasan, M. Nasser, B. Pal, and S. Ahmad, “Support Vector Machine


and Random Forest Modeling for Intrusion Detection System (IDS),” J. Intell.
Learn. Syst. Appl., vol. 06, no. 01, pp. 45–52, 2014, doi:
10.4236/jilsa.2014.61005.

[21] E. M. Knox and R. T. Ng, “Die Arbeiterfürsorge und die Aerzte,” Dtsch.
Medizinische Wochenschrift, vol. 17, no. 49, pp. 1341–1342, 1891, doi:

29
10.1055/s-0029-1206900.

[22] “ Sakha A., “Cyber-Forensic Log Analysis”, Unpublish,” 2008.

[23] N. Kiatwonghong and S. Maneewongvatana, “Intelli-Log : A Real-time Log


Analyzer,” pp. 0–5, 2010.

[24] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “Proceedings of the 2014


7th IEEE Symposium on Computational Intelligence for Security and Defense
Applications, CISDA 2014,” Proc. 2014 7th IEEE Symp. Comput. Intell. Secur.
Def. Appl. CISDA 2014, no. Cisda, p. 164p, 2015.

[25] S. S. Kaushik and P. R. Deshmukh, “Detection of Attacks in an Intrusion


Detection System,” Int. J. Comput. Sci. Inf. Technol., vol. 2, no. 3, pp. 982–986,
2011, [Online]. Available:
https://pdfs.semanticscholar.org/20f0/adc524e835d921c631e8d778f656e6cdeb6b

[26] J. Han, M. Kamber, F. Berzal, and N. Marín, “Data Mining: Concepts and
Techniques,” SIGMOD Rec., vol. 31, no. 2, pp. 66–68, 2002, doi:
10.1145/565117.565130.

[27] T. Santhanam, ‘’An Empirical Comparison of Ensemble and Hybrid

[28] Sakha A., “Cyber-Forensic Log Analysis”, Unpublished Doctoral dissertation,


Concordia University, 2008.Classification’’, 2014.
[29] Nathaphon Kiatwonghong and Songrit Maneewongvatana, “Intelli-log: A Real-
time Log Analyzer”, 2nd International Conference on Education Technology
and Computer (ICETC), pp. 22-24, June 2010.
[30] Akal, Tigabu DagneConstructing a Supervised Model for Network Intrusion
Detection March 2016, Vol. 9, No. 1
[31] Stefanova, Zheni Svetoslavova Machine Learning Methods for Network Intrusion
Detection and Intrusion Prevention Systems International Journal of Computer
Applications (0975 – 8887) Volume 106 – No. 18, November 2014

30

You might also like