Ftawu Tekola Proposal After DEfeince2 - For Merge
Ftawu Tekola Proposal After DEfeince2 - For Merge
Ftawu Tekola Proposal After DEfeince2 - For Merge
Proposal title: “Improve Intrusion detection model using machine learning: the case of
Ethiopian WoredaNet network”
March -2022
Declaration
I, Fitawu Tekola, the under signed, declare that this thesis entitled: “Improve Intrusion
detection model using machine learning: the case of Ethiopian WoredaNet network” is
my original work. I have undertaken the research work independently with the guidance and
support of the research advisor. This study has not been submitted for any degree or diploma
program in this or any other institutions and that all sources of materials used for the thesis
has been duly acknowledged.
Declared by
Name___________________________
Signature: ____________________
Department ____________________
Date __________________________
Name: <Fitawu Tekola Mekuria>
Title: <Improve Intrusion detection model using machine learning: the case of
Ethiopian WoredaNet network>
1 Problem statement general With the rapid growth of the Internet and the
problem,not related to case area
ever-increasing security problems associated
with its popularity, the need for protection
3
against unwanted intruders has become very
important So intrusion detection system
widely applied in different organization.
22_23
3 Sample attribute not mentioned The dataset is checked for the consistency of
individual attribute values and types, quantity
18
and missing values, using WEKA
prepossessing panel and Microsoft excel.
Keywords: Wrapper, Machine Learning, Accuracy, Detection Rate, False alarm rate
Table of Contents
Declaration............................................................................................................................................II
Certification.........................................................................................................................................III
Abstract.................................................................................................................................................V
List of Figure..................................................................................................................................VII
CHAPTER ONE...................................................................................................................................1
1. INTRODUCTION.............................................................................................................................1
1.1. Background................................................................................................................................1
1.2. Problem Statement.....................................................................................................................3
1.3. Objective....................................................................................................................................4
1.3.1. General Objective....................................................................................................................4
1.3.2. Specific Objectives..................................................................................................................4
1.4. Research Questions....................................................................................................................5
1.5. Motivated...................................................................................................................................5
1.6. Hypothesis of the Study..............................................................................................................5
1.7. Scope and Limitation of the Study.............................................................................................5
1.8. Limitations.................................................................................................................................6
1.9. Significance of the Study............................................................................................................6
1.10. Organization of the Thesis........................................................................................................7
CHAPTER TWO..................................................................................................................................8
2. Review of Related Literature.........................................................................................................8
2.1. Theoretical Literature Review....................................................................................................8
2.2. Empirical Literature Review......................................................................................................8
2.3. Intrusion Detection System Technologies................................................................................11
2.4. Classification of IDS based on Analysis type...........................................................................12
2.4.1. Signature-Based Detection....................................................................................................12
2.4.2. Anomaly-Based Detection.....................................................................................................13
2.5. Data mining and Intrusion Detection........................................................................................13
2.5.1. Clustering..............................................................................................................................14
2.6. Classification.............................................................................................................................15
2.7. Conceptual Framework............................................................................................................16
CHAPTER THREE.............................................................................................................................17
3. Research methodology (materials and methods).........................................................................17
3.1. Methodology............................................................................................................................17
3.2. Materials...................................................................................................................................18
3.2.1. Software used........................................................................................................................18
3.2.2. Hardware used.......................................................................................................................18
3.3. Data Collection.........................................................................................................................18
3.3.1. Data Preparation....................................................................................................................18
3.3.2. NSL-KDD Data Set...............................................................................................................18
3.4. Methods of Training and Testing.............................................................................................19
3.5. Methods of Analysis and Evaluation of System Performance..................................................20
3.5.1. Confusion Matrix..................................................................................................................21
3.6. Work Flow...............................................................................................................................22
3.7. Work break down (WBD)........................................................................................................24
3.7.1. Gantt chart for proposed thesis..............................................................................................25
3.8. Budget estimating for proposed thesis......................................................................................26
Ethical Considerations.....................................................................................................................26
Reference........................................................................................................................................27
List of Figure
Figure 2. feature selection method.....................................................................................................16
Figure 3: 10-fold cross validation........................................................................................................20
Figure 4: Work Flow of Model Selection.............................................................................................23
Figure5: Gantt chart............................................................................................................................25
List of Table
Table 1: work break down...................................................................................................................24
Table 2: Budget estimating for proposed thesis..................................................................................26
CHAPTER ONE
1. INTRODUCTION
1.1. Background
Network and system security is of dominant importance in the present data
communication environment and becomes an essential component of Information and
Communication Technology (ICT) infrastructure that protect intangible as well as
tangible assets (i.e. software and hardware) in public and private domain. With the
widespread utilization of information technology applications, organizations are
becoming more prone to the security threats for their resources with no matter how
strict the security policies and mechanisms are implemented. As network playing more
important role in the modern society, crimes using computer network are also become
trends and accelerating. By using unauthorized access, Hackers and intruders can create
many successful attempts to cause the crash of the networks and web services. So,
establishing a safe and strong network system that guarantee the security of information
requires a research focus [1] and new threats and associated solutions to prevent these
threats are emerging together with the secured system evolution [2].
Nowadays, network security devices are equipped with one or more security services
including Firewall, Intrusion Prevention/Detection Systems (IPS/DS), Data Leak
Prevention (DLP), and also provide Content Security Filtering Functions like anti-spam,
anti-virus or URL filtering. These services and functions have increasingly been
integrated into Unified Threat Management (UTM) systems or Next Generation (NG)
firewalls. Security devices need continuous testing to guarantee that the devices are
efficient, precise and useful, while simultaneously maintaining acceptable performance.
Network security is a critical concern for enterprises, government agencies and
organizations of all sizes[3].
1
WoredaNet is a Wide Area Network intended to link all woreda administrative units in
Ethiopia that aims to build terrestrial and satellite-based network, connecting lowest
levels of government.
WoredaNet ICT Network as shown in figure 1, in which all public sectors in the
country are to be connected to get e-Services from National Data Center. Based on
this initiative, there is one National Data Center and nine Regional Data Centers in
the country. The National Data Center is connected to the Regional Data Centers and
to all Woredas that have WoredaNet ICT Network in the country. Over 700 Woredas
(districts) are interconnected [5].
2
According to [4,5] there is several government applications for e-services already put
in place (and planned to be used) through the Woreda Information Systems (WIS).
There are thirteen ministries and nineteen agencies/offices are to be connected to the
WoredaNet; They will have independent servers locally and at national data center
and can be used for video conferencing, hosting; application, directory service, email,
voice over IP, internet service, Electricity Bill Information, Exchange Rates
Information, National Exam Student Result Information, etc.
With the rapid growth of the Internet and the ever-increasing security problems
associated with its popularity, the need for protection against unwanted intruders has
become very important [7]. So intrusion detection system widely applied in different
organization.
By using unauthorized access, hackers and intruders can create many successful
attempts to cause the crash of the networks and web services by unauthorized
intrusion
Termination of misuse and denial of the service problems Some of the reasons are
the absence of proper, continual, and consistent security mechanisms, which leads
the attacks and threats to occur.
3
1.2.Objective
To assess the existing security trends and improve security holes inside the data centers
on the WoredaNet network.
To review literature on the concept of intrusion detection system in the area of data
mining.
To improve security model to analyze illegitimate activities in WoredaNet network.
To define system and users requirements.
To recognize the necessities of security systems to protect their computer networks and
reduce the risk of compromising their information.
What is the possible intrusion detection model for Woredanet network Regional
state Data centers?
Which detection technique is the best to use as a model to enhance data center
security?
4
How to customize machine learning algorithm to improve IDS?
1.4. Motivated
The security issues were also addressed with standard software and hardware
implementations. But with all these security efforts, still security becomes an issue in
the WoredaNet network data centers, including the national data center which is the
gateway of the WoredaNet network. Thus, these scenarios motivated me to work and
contribute to the WoredaNet network security issue.
This work did not look at other components of an Intrusion Detection System (IDS) like
data collection and response since it is very wide area of research.
The NSL-KDD dataset is collected for research purpose and used in a simulated
environment so; it might not be a perfect representation of the real threats. But shows
the effectiveness of anomaly based intrusion detection systems to overcome the
weakness of signature based intrusion detection systems.
1.7. Limitations
Limitations are boundaries that interface with the possibilities after a certain extent.
5
The limitations for this research is tools used for implementation is not fully free.
Several major local languages are widely used in Ethiopia, which vary across the
Regions
This work uses structured questioner, interview, and observation from the WoredaNet
network data centers to find and suggest security requirements as well as suggest
security model solution of intrusion detection (IDS).need more a time to complete the
thesis and to get the expected objectives
Network security is a high priority concern for companies and home users, because they
keep sensitive information on their computers, there is a great need to protect that
information from unauthorized access. Even though conventional approach to securing
computer systems against cyber threats (which use firewalls, authentication tools, and
virtual private networks) create a protective shield. It almost always has vulnerabilities.
There for something must be done in order to improve information security efficiency.
Researcher
The research work has an explicit significance in development of knowledge for the
researcher and uses as a benchmark for interested researchers to explore the issues in
the area.
Regional office
6
It almost always has vulnerabilities There for something must be done in order to
improve information security efficiency, so this work used to improve intrusion
detection system for Ethiopian WoredaNet network .
Network security is a high priority concern for Ethiopian WoredaNet network because
they keep sensitive information on their data center , there is a great need to protect that
information from unauthorized access. Even though conventional approach to securing
data center systems against cyber threats create a protective shield.
Government
IF the woredanet network security is poor the user does not have confidence in the
agency, so this study can improve the security of the woredanet. in this case, the federal
and regional governments are confident by the user of woredanet so the government
may one of my studs significant used.
7
CHAPTER TWO
Intrusion detection systems (IDS) are network security appliances that monitor network
and/or system activities for malicious activity. It can be any device/software which
exercises access control to protect computers from illegal exploitation. "Intrusion
prevention" technology is considered by some to be an extension of intrusion detection
(ID) technology, but it is actually another form of access control, like an application
layer firewall [7].
8
multi-layer neural network. In this work, simplified mutual information-based feature
selection algorithm selects the next features by considering only the recently selected
feature. The multi-layer neural network is trained on the selected features, and it
produced 99.34% of accuracy when the selected feature count is reached 19. Author
claimed that accuracy of the proposed model is more than the existing proposals that
they considered. Author also declared that proposed work reduced the computational
resources and time to train the model.
The IDS models in this study are developed on full training Network Simulation
Language- Knowledge discovery in Database (NSL-KDD) dataset using a powerful
machine learning and data mining WEKA tool. The data mining model used in this
study is the KDD process. The KDD process refers to the whole process of changing
low level data into high level knowledge whose automated discovery of patterns and
relationships in large databases and data mining is one of the core steps in the KDD
process. The goal of KDD and DM is to find interesting patterns and/or models that
exist in databases but are hidden among the volumes of data (Fayyad et al., 1996). The
KDD process as described by Fayyad et al (1996) consists of five major phases. Data
9
were collected using appropriate algorithms then mined patterns were modeled
The proposed model would offer the advantage of considering those unlabeled records.
In this case there was a filling of only the top few most confident data points making
empty the class of rest records. Supervised learning is more suitable for intrusion
detection because they require a small quantity of labeled data while still taking
advantage of the large quantities of unlabeled data. Both the J48 decision tree algorithm
and the Naïve Bayes simple algorithm have been tested as a classification approach for
building a predictive model for intrusion detection
The result of the study has shown that the J48 decision tree algorithm with cross-
validation test mode and other default values is appropriate in the area of intrusion
detection.
10
with firewall rule like settings, allowing them to identify network traffic that violates
the organization’s security or acceptable use policies. Also, some IDSs can monitor file
transfers and identify ones that might be suspicious, such as copying a large database
onto a user’s laptop
Many IDSs can also identify inspection activity, which may indicate that an attack is
forthcoming [16]. For example, some attack tools and forms of malware, particularly
worms, perform investigation activities such as host and port scans to identify targets
for succeeding attacks. An IDS might be able to block reconnaissance and notify
security administrators, who can take actions if needed to alter other security controls to
prevent related incidents. Because inspection activity is so frequent on the Internet,
reconnaissance detection is often performed primarily on protected internal networks. In
addition to identifying incidents and supporting incident response efforts, organizations
have found other uses for IDSs, including the following: Identifying security policy
problems, documenting the existing threat to an organization and deterring individuals
from violating security policies. Because of the increasing dependence on information
systems and the prevalence and potential impact of intrusions against those systems,
IDSs have become a necessary addition to the security infrastructure of nearly every
organization [15].
11
traffic and then proceed to analyze it [17]
12
recognition detect unusual behavior. Thus they have the ability to detect attacks for
which they have no specific knowledge. And anomaly detectors produce information
that is very useful to define new patterns for signature detection. The disadvantages of
anomaly based detection are: The detection of anomalies produces a high number of
false alarms due to the unpredictable behavior of users and networks. And they require
very hard training to characterize patterns of normal behavior [18]
2.5.1. Clustering
The amount of available network audit data instances is large, human labeling is time-
consuming and expensive so it requires clustering technique that is helpful to process,
label and assigning data into groups. Clustering algorithms can group new data
instances into similar groups. These groups can be used to increase the performance of
existing classifiers. High quality clusters can also assist human expert with labeling.
Clustering discovers complex intrusions occurred over extended periods of time and
different spaces, correlating independent network events[19]. The sets of data belonging
to the cluster are modeled according to pre-defined metrics and their common
features[20]. It is used to detect hybrids of attack in the cluster.
13
The steps involved in identifying intrusion in using clustering techniques are as follows:
1. Find the largest cluster, i.e., the one with the most number of instances, and
label it normal.
2. Sort the remaining clusters in an ascending order of their distances to the largest
cluster.
3. Select the first K1 clusters so that the number of data instances in these clusters
sum up to A 'N, and label them as normal, where ' N is the percentage of normal
instances.
4. Label all the other clusters as attacks.
Unlike traditional anomaly detection methods, they cluster data instances that contain
both normal behaviors and attacks, using a modified incremental k-means algorithm.
After clustering, heuristics are used to automatically label each cluster as either normal
or attacks. The self- labeled clusters are then used to detect attacks in a separate test
dataset [19].
2.6. Classification
An intrusion detection system that classifies audit data as normal or anomalous based
on a set of rules, patterns or other affiliated techniques can be broadly defined as a
classification-based intrusion detection system.
14
Data classification for intrusion detection can be achieved by the following basic steps
[27]
2. Next step is to scan each of the intrusion traces. For each sequence of system
calls, first look it up in the normal list. If an exact match can be found then the
sequence is labeled as normal otherwise it is labeled as abnormal.
3. Then ensure that the normal traces include nearly all possible normal short
sequences of system calls. The Intrusion trace contains many normal
sequences in addition to the abnormal sequences since the illegal activities
only occur in some places within a trace.
15
Figure 2. feature selection method
CHAPTER THREE
3.1. Methodology
Both qualitative and quantitative methods will be used to conduct the study. NDC
(National Data Center) and Regional Data Centers will be conducted under the study.
Applications software’s and system software’s that runs on Woredanet VPN will be
conducted. National data center and some Regional bureaus and sample Woredas that
are connected to woredanet VPN are also part of the study.
The methodology of this research work is divided in to two phases in order to achieve
the general and specific objectives.
1. Issue Based Analysis Technique is used for surveying the existing security issues in
the WoredaNet network by using structured questionnaires, structured open
interviews and observation.
16
selected features.
3.2. Materials
17
WEKA tool. For the feature selection analysis, Ten-Fold Cross Validation Evaluation
technique is used.
The NSL-KDD data set has four categories of attack classes. These are:
DoS: the attacker does not allow legitimate users access to computing resources
or overloads them so that requests cannot be processed in real time. The result of
this attack is the unavailability of resources, i.e. resources are too busy or too full
to serve legitimate networking requests and hence denying users access to a
machine.
18
type, by which an attacker uses a normal account to login into a victim system
and tries to gain root/administrator privileges by exploiting some vulnerability
in the victim e.g. buffer overflow attacks. Relevant features: “number of file
creations” and “number of shell prompts invoked,”
3. R2L: Unauthorized access from a remote machine, the attacker intrudes into a
remote machine and gains local access of the victim machine. E.g. password
guessing Relevant features: Network level features - “duration of connection”
and “service requested” and host level features - “number of failed login
attempts” [21] [22].
19
Figure 3: 10-fold cross validation
20
along the diagonal of the confusion matrix with the rest of the entries being closed to
zero [24]. In confusion matrix, there are classifier evaluation metrics like accuracy,
detection rate, false alarm rate, precision, recall, and F-measure. Table 7 shows two-
class classification result simple confusion matrix that contains both predicted and
actual classes of normal and anomalous traffics of the dataset.
Key: TN= True Negative, TP =True Positive, FN =False Negative, and FP =False
Positive
Supervised Machine Learning (ML) has several ways of evaluating the performance of
the classifiers [25]. Here are some of performance evaluation computational techniques
on confusion matrix that are used in this study. Accuracy is the first one which is widely
used to check the performance of the model. It is the percentage of test set tuples that
are correctly classified [42]. Accuracy = (TP + TN)/
(TP+TN+FP+FN………………………………………………………………………..
(1)
Detection Rate (DR): Detection Rate is also referred as True positive rate i.e., the
proportion of positive tuples that are correctly identified.
DR = TP/ (TP+FN)……………………………………………………………………(2)
False Alarm Rate (FAR): False Alarm rate is the False positive rate that is the
proportion of negative tuples that are correctly identified.
FAR= FP/ (FP+TN)……………………………………………………………………(3)
21
3.6. Work Flow
This sub section describes about abstraction of the work flow and work flow of model
selection. In the process, first the existing WoredaNet network security system is
surveyed using structured questionnaires and then security problems are identified from
the existing WoredaNet data centers. Finally a technical solution is selected using
machine learning.
The model consists of four phases: Data collection phase, Data PreProcessing phase,
Feature selection phase and Analysis phase. In first phase data is collected, which is
prepossessed and become usable in second phase. Further, in the third phase the
relevant features are selected by Wrapper-Best First technique algorithms Decision Tree
and Support Vector Machine then evaluated by 10 fold cross validation, and in the final
phase, the four algorithms such as Decision Tree and Support Vector Machine are
trained using training dataset then evaluated by test dataset on the selected features.
1.Input dataset D
Discretization
3.Selecting feature with optimum information using a selected method Subset generation
Subset evaluation
22
23
Figure 4: Work Flow of Model Selection
2.3 Key Points Analysis Feb 06/2022 Feb 20/2022 15days 2.2
24
3.7.1. Gantt chart for proposed thesis
25
Figure5: Gantt chart
26
3.8. Budget estimating for proposed thesis
Financial (budget) Requirements of the proposed Research
No Estimate Estimated Remark
Item Quantity unit price Total
in Birr Price in
Birr
1 Local travel 5,000 5,000
Data collecting 3,000 3,000
2 Transportation 3,200 3,200
3 For internet access 2,000
4 Field subsistence 4,500
5 External HDD 1TB 1 4000 4,000 To store file & for backup
6 MATLAB software training 1 120 1,800
7 USB flash 64 GB 1 4,00 4,00 For transferring file
8 Consumable Different type 2,000
9 Printing 2 400 8,00
11 Total 26,700
12 Contingency (5%) 1335
13 Grand Total 28035
Ethical Considerations
This research work guarantees the quality and integrity of The security issues also
addressed with standard software and hardware implementations as well as privacy.
Security audit of organizations is conducted based on their security solutions, policies,
standards, processes and procedures that exist in the different security management
domains as defined by International Standard Organization (ISO).
27
Reference
[1] C. M. and W. C, “A Study of Intrusion Detection System Based on Data Mining”
Institute of Electrical and Electronics Engineers (IEEE) 978-1-4244-6943-7/10/
© IEEE,” 2010.
[3] D. Schneider, “The state of network security,” Netw. Secur., vol. 2012, no. 2, pp.
14–20, 2012, doi: 10.1016/S1353-4858(12)70016-8.
[11] P. Diwan and R. C. Jain, “A Combined Approach for Intrusion Detection System
Based on the Data Mining Techniques,” ISSN || Int. J. Comput. Eng. Res., vol.
28
04, no. 6, pp. 2250–3005, 2014, [Online]. Available: www.ijceronline.com.
[15] J. Kizza and F. Migga Kizza, “Intrusion Detection and Prevention Systems,”
Secur. Inf. Infrastruct., pp. 239–258, 2011, doi: 10.4018/978-1-59904-379-
1.ch012.
[19] วัน ชัย วัฒ นศัพ ท์, “No Title การมีส ่ว นร่ว มกับ การพัฒ นาองค์ก ร บรรยายในการ
สัมมนานายจ้างและลูกจ้า งภาครัฐวิสาหกิจ เรื่อง ระบบทวิภาคีกับการแก้ไขปั ญหา
แรงงานในรัฐวิสาหกิจ : กอรัฐวิสาหกิจสัมพันธ์ กรมสวัสดิการและคุ้มครองแรงงาน
กระทรวงแรงงาน,” 2546.
[21] E. M. Knox and R. T. Ng, “Die Arbeiterfürsorge und die Aerzte,” Dtsch.
Medizinische Wochenschrift, vol. 17, no. 49, pp. 1341–1342, 1891, doi:
29
10.1055/s-0029-1206900.
[26] J. Han, M. Kamber, F. Berzal, and N. Marín, “Data Mining: Concepts and
Techniques,” SIGMOD Rec., vol. 31, no. 2, pp. 66–68, 2002, doi:
10.1145/565117.565130.
30