Network Intrusion Detection System Using Machine Learning With Data Preprocessing and Feature Extraction
Network Intrusion Detection System Using Machine Learning With Data Preprocessing and Feature Extraction
https://doi.org/10.22214/ijraset.2022.44326
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
Abstract: Unauthorized access to a computer network can be discovered by scanning the network traffic for evidence of
malicious activity, which is what Network Intrusion Detection (NID) does. However, in this study, we will concentrate on the
technology, development, and strategic importance that make up the large field of Network Intrusion Detection (NID). Many
new strategies have been created in the last few years to help computer security specialists in protecting a single host or an entire
network against unauthorized access, theft, and denial-of-service assaults, which are the primary causes of computer crime.
Intrusion Detection is critical for both the military and commercial sectors since it is the most significant study area for the
future networks' Information Security. In this paper, a model is being proposed, where the data is preprocessed before training
with the algorithms. A study done by comparing with other models shows that, the current model built with Random Forest can
outperform other existing models built with ANN when the data is preprocessed. After building model after data pre-processing
and feature extraction, we are able to achieve 98.71% accuracy on NSL-KDD dataset.
Keywords: Network intrusion detection (NID), random forest, multi-layer perceptron are some of the terms used in this article.
I. INTRODUCTION
Network security has become a major problem in today's world, as the unlawful actions in the networking world continue to rise at
an alarming rate, making network security neither hopeless nor addressed. There was a time when only firewalls were in place to
protect networks from cyberattacks; Morris Worm launched the first internet-wide attack and penetration on November 2, 1988.
Since then, new technologies have been created to protect networks from cyberattacks, as well. Vendors used to ship the user name
and password along with the equipment in order to avoid the danger of a hacker gaining access. In the mid-1980s, Denning's work
on Intrusion Detection (ID) was published for the first time in print. Because intrusive activity differs from regular behavior,
Denning believes that the primary role of the IDS is to develop acceptable models of normal behavior that may be used to identify
intrusive activity. In computer networking, intrusion detection is the most recent and essential study subject. A number of prototypes
have been created that use a variety of methodologies for commercial and military uses alike, with ID being one of the most popular
topics of discussion.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2360
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
ID relies on a mix of brute force and incursion tactics. An IDS (Intrusion Detection System) is now an essential aspect of network
security since it protects the network from both successful and unsuccessful intrusions. Detecting and reporting any unexpected
system behavior is the primary goal of the intrusion detection system (IDS). IDS can provide a real-time response to all intrusion
events after a thorough investigation of the behavior and signature of the intrusion detection. As well as the above, IDS is capable of
• Verification of system errors; monitoring of system and user activity.
System and data file integrity is being assessed.
• Keep a running tally of any unusual conduct.
• A model for mapping known assaults and alerts to recognize them.
Some IDS respond to an illegal intrusion in the system by terminating the network connection as a response to the intrusion, rather
than just preventing it from happening in the first place [2].
For IDS, the following are the primary criteria for measuring performance.
• TRUE POSITIVE: an actual attack (IDS gives alarm).
An attack has not taken place (IDS gives alarm).
FALSE NEGATIVE: a smear campaign (IDS gives no alarm)
NEGATIVE VERBAL: there was no attack (IDS gives no alarm)
A. NSL-KDD Dataset
The NSL-KDD dataset consists of attacks for 4 different categories with 43 features for each attack. Where 41 features refers
directly to the network traffic input. This dataset is used as standard dataset for network intrusions. The dataset contains 39 different
subclass types of the attacks.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2361
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
C. Use Of Data Mining And String Metrics For String Anomaly Detection
Cyber-attacks on computers and networks are becoming more frequent and more severe. In the modern world, intrusion detection is
a vital technology and an active research topic. This research presents an adaptive method to anomaly-based intrusion detection
systems using data mining techniques and string metrics. Based on the findings of the simulation studies, the proposed method is
proven to generate trustworthy results while simultaneously monitoring and alerting the protected system.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2362
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
B. Modules
1) Pre-processing of the data It is possible to normalize data using the standard scalar Using One-Hot-Encoding to convert
Categorical data to Numeric data
2) Extraction of Characteristics (Features)
3) Finding out which of the following attributes has a Pearson-correlation of higher than 0.5 with the encoded attack label
attribute:
4) One-hot encoded and original attack labels can now be combined.
5) model's training algorithms (Training data: 75 % , Testing data: 25 % )
6) Then Training with MLP ( Multilayer Perceptron Algorithm)
7) Same data set Training with Random Forest Algorithm
An IDS based on signatures will scan the network for malicious packets and compare them to a database of known signatures. By
scanning each intrusion event for a known attack or signature, they work similarly to a virus scanner [5]. While a signature-based
IDS is extremely effective at detecting known attacks, it does, like anti-virus software, require regular updates to keep up with
changes in hacker techniques. This strategy makes it simple to spot the most prevalent and well-known types of attacks. However,
one of the primary issues is how to construct a signature that encompasses all the modifications and variations of the assault. If the
attack pattern is fully unknown, this strategy will not work. Changing the attack signature is not an option. An open source
signature-based IDS known as SNORT has been used as a benchmark by numerous academics to compare their own IDS to. As the
system is being built on Random Forest, which is supervised learning algorithm, it doesn’t need heavy GPU’s to run. Hence, the
cost of the model is being reduced.
IV. ALGORITHM
A. MLP
A feedforward artificial neural network, a multilayer perceptron (MLP) produces a collection of outputs from a set of inputs.
Directed graphs are used to connect the input and output layers of an MLP. In order to train the network, MLP employs back
propagation. Multiple layers of neurons are placed together in a Multilayer Perceptron to form input and output layers and one or
more hidden levels. Neurons can employ any activation function in a Multilayer Perceptron, unlike neurons in a Perceptron, which
must have a threshold-imposing activation function like ReLU or sigmoid.
The layers are hidden in multi-layer perceptron (other than a single input and single output layers). A multi-layer perceptron, in
contrast to a single-layer perceptron, may learn non-linear functions as well as linear ones.
The perceptron is only capable of solving simple issues. It has the ability to insert a hyperplane in pattern space and move it till the
mistake is minimized. Only if the problem can be linearly divided does this method work.
B. Random Forest
An algorithm known as Random Forest is a Supervised Machine Learning Algorithm that is commonly utilized in classification and
regression applications. It creates decision trees on a variety of samples and uses their majority vote to classify and average the data.
For classification and regression, random forests and random decision forests are both ensemble learning methods that work by
training a large number of decision trees at the same time. The random forest's output is the categorization chosen by the majority of
trees when performing classification tasks.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2363
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
Individual tree predictions are averaged together for regression tasks. Overfitting of decision trees by random decision forests is
corrected. To some extent they beat decision trees, but random forests are less accurate than gradient-boosted trees. Data features,
on the other hand, have the potential to influence their performance.
In addition to regression and classification tasks, it can also be used. The forecasts produced by a random forest are accurate and
understandable. It is able to process massive amounts of data quickly. The random forest method outperforms the decision tree
algorithm when it comes to forecasting outcomes.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2364
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
In order to protect the integrity of the data, data confidentiality ensures that no unauthorized individual can access it. An important
part of data and communications integrity is ensuring that information is transmitted accurately, without causing damage to
individual pieces of data, and that it can be trusted. All of this service must ensure that the hardware and software of the system
work perfectly, and that the data is protected from being tampered with. A common type of website hacker attack, denial-of-service
is used to take down an entire website at once. When a (remote) entity cannot be accessed, this happens. Despite the fact that these
attacks are not fully prevented, we can at least lessen their likelihood. To keep a computer or network safe, most people use security
settings to create a protective shield around it. To get access to the system, intruders must identify and validate their identity via a
series of security checks.
VII. CONCLUSION
Here, an introduction to Intrusion Detection technology is given as well as a look at the current technology, its challenges, and the
future of Intrusion Detection technology are discussed in detail. Currently, intrusion detection is a relatively new technology, which
means that there are a huge number of open research, engineering, and scientific prospects. Intrusion Detection Systems are widely
utilized around the world, and more money is being invested in research and development of new systems. Intrusion Detection
technologies will continue to improve since the threat of a larger attack will never go away.
Hence, we can conclude that the model built with Random Forest outperformed the model built using ANN techniques. Also the
cost of the model is reduced by huge as the Random Forest doesn’t require GPU’s to run.
REFERENCES
[1] D. E. Denning, “An intrusion-detection model,” Software Engineering, IEEE Transactions on, no. 2, pp. 222–232, 1987.
[2] Y. Bai and H. Kobayashi, “Intrusion detection systems: technology and development,” in Advanced Information Networking and Applications, 2003. AINA
2003. 17th International Conference on. IEEE, 2003, pp. 710–715.
[3] B. Mukherjee, L. T. Heberlein, and K. N. Levitt, “Network intrusion detection,” Network, IEEE, vol. 8, no. 3, pp. 26–41, 1994.
[4] M. F. Marhusin, D. Cornforth, and H. Larkin, “An overview of recent advances in intrusion detection,” in Computer and Information Technology, 2008. CIT
2008. 8th IEEE International Conference on. IEEE, 2008, pp. 432–437.
[5] E. Nikolova and V. Jecheva, “Anomaly based intrusion detection using data mining and string metrics,” in Communications and Mobile Computing, 2009.
CMC’09. WRI International Conference on, vol. 3. IEEE, 2009, pp. 440–444.
[6] J. McHugh, A. Christie, and J. Allen, “Defending yourself: The role of intrusion detection systems,” Software, IEEE, vol. 17, no. 5, pp. 42–51, 2000.
[7] D. Dasgupta, “Advances in artificial immune systems,” Computational Intelligence Magazine, IEEE, vol. 1, no. 4, pp. 40–49, 2006.
[8] S. Naiping and Z. Genyuan, “A study on intrusion detection based on data mining,” in Information Science and Management Engineering (ISME), 2010
International Conference of, vol. 1. IEEE, 2010, pp. 135–138.
[9] M. Shankar, N. Rao, and S. Batsell, “Fusing intrusion data for detection and containment,” in Military Communications Conference, 2003. MILCOM’03. 2003
IEEE, vol. 2. IEEE, 2003, pp. 741–746.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 2365