Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
License: CC BY 4.0
arXiv:2403.11180v1 [cs.CR] 17 Mar 2024

usfAD Based Effective Unknown Attack Detection Focused IDS Framework

Md. Ashraf Uddin ashraf.uddin@deakin.edu.au Sunil Aryal sunil.aryal@deakin.edu.au Mohamed Reda Bouadjenek Muna Al-Hawawreh Md. Alamin Talukder alamin.cse@iubat.edu School of Information Technology, Deakin University, Geelong, VIC 3125, Australia Department of Computer Science and Engineering, International University of Business Agriculture and Technology, Dhaka, Bangladesh
Abstract

The rapid expansion of varied network systems, including the Internet of Things (IoT) and Industrial Internet of Things (IIoT), has led to an increasing range of cyber threats. Ensuring robust protection against these threats necessitates the implementation of an effective Intrusion Detection System (IDS). For more than a decade, researchers have delved into supervised machine learning techniques to develop IDS to classify normal and attack traffic. However, building effective IDS models using supervised learning requires a substantial number of benign and attack samples. To collect a sufficient number of attack samples from real-life scenarios is not possible since cyber attacks occur occasionally. Further, IDS trained and tested on known datasets fails in detecting zero-day or unknown attacks due to the swift evolution of attack patterns. To address this challenge, we put forth two strategies for semi-supervised learning based IDS where training samples of attacks are not required: 1) training a supervised machine learning model using randomly and uniformly dispersed synthetic attack samples; 2) building a One Class Classification (OCC) model that is trained exclusively on benign network traffic. We have implemented both approaches and compared their performances using 10 recent benchmark IDS datasets. Our findings demonstrate that the OCC model based on the state-of-art anomaly detection technique called usfAD significantly outperforms conventional supervised classification and other OCC based techniques when trained and tested considering real-life scenarios, particularly to detect previously unseen attacks.

keywords:
, IoT, Network Traffic, Intrusion Detection System, Anomaly Detection, One Class Classification, Zero Day Attacks.
journal: Journal of  Templates

1 Introduction

Intrusion Detection Systems (IDS) play a crucial role in safeguarding computer networks against cyber attacks Talukder et al. (2024b). An Intrusion Detection System (IDS) examines network traffic and issues alerts whenever suspicious network or/and system activity is detected Talukder et al. (2023). With the increasing reliance on information network technology, cyber attacks targeting IoT have recently risen significantly. These attacks pose a significant threat not just to IoT but also aggressively target areas crucial to our society, such as national security, corporate data integrity, and public safety. Therefore, it is imperative to develop and deploy IDS for effective detection and prevention of these threats Mahmood et al. (2024); Agate et al. (2024); Belenguer et al. (2023).

Many influential researchInjadat et al. (2020), Gu and Lu (2021), Kilincer et al. (2021), Roy et al. (2022), Kilincer et al. (2022), Naseri and Gharehchopogh (2022) adopted supervised machine to build IDS where it requires large numbers of both normal and attack instances. These models heavily depend on historical data for training, which might not always include the latest types of cyber attacks. Further, the accuracy of these IDS varies, and they can often yield false negatives (failing to detect actual threats).

We can classify cyberattacks as known and unknown attacks (also called zero-day attacks) in light of IDS’s familiarity with the attack during the training phase. Known attacks have specific, identifiable signatures that IDS can recognize based on its training. These attacks are easier to handle as the IDS is already familiar with their characteristics and patterns while training the model. In contrast, unknown attacks present a greater challenge. These are attacks that the IDS has not encountered before, and thus, they lack recognizable patterns. Traditional IDS systems often struggle to identify these unknown attacks because they don’t match any expected behavior, profiles, or known attack signaturesFahad et al. (2017); Aghaei and Serpen (2019); Sánchez et al. (2021); Anand and Saifulla (2023).

IDS that uses supervised Machine Learning (ML) algorithms typically learns patterns of normal and attack categories from training data to classify particular kinds of network traffic instances. Trained IDS can mostly identify testing instances based on their learned patterns Talukder et al. (2024c). ML struggles to correctly identify the class of instances that are not encountered during the training phase. In domains such as network intrusion, and credit card fraud detection, obtaining enough attack instances for training an ML model is challenging due to their scarcity compared to normal traffic/dataTalukder et al. (2024a). In addition, the characteristics of such attacks change swiftly. As a result, in practical scenarios, such models tend to produce a higher number of false negatives (identifying attacks as benign) which is not acceptable in real-world applications. The primary concern regarding false negatives is that they enable real threats to remain undiscovered and unaddressed. Such occurrences can result in effective attacks, leading to possible detriment to systems, breaches of data, monetary loss, and injury to an organization’s reputation.

To demonstrate the above-mentioned limitation of supervised learning-based IDS, we develop and evaluate the capability of the Random Forest (RF), the supervised learning technique that is shown to have superior performance over other counterparts in attack classificationNegandhi et al. (2019); Liu et al. (2021); Wu et al. (2022), in detecting attacks that are previously unseen during the training phase. We train the RF model to classify attack and normal traffic (i.e., binary classification) by purposefully excluding some attack types from the training set while ensuring all attack types are present in the test set. Note that the task here is to differentiate attack from normal and not to identify attack types correctly. All the experiments are conducted using a 10-fold stratified cross-validation and results in the two widely used IDS benchmark datasets of NSL-KDD and UNSW-NB15 are illustrated in Figure 1. On the x-axis, labels C0, C1, …, Cn denote the number of attack types omitted during the training step. C0 signifies that all attack types are incorporated into the training set. C1 means that we sequentially omit each attack category from the training set (first excluding attack category 1, then retaining it while excluding attack category 2, and so on). C2 represents the removal of two attack categories at once, and this pattern continues progressively. The y-axis reflects the average F1-score corresponding to combinations of omissions. As there are many possible combinations to remove n𝑛nitalic_n attack types, we tried all possible combinations and presented the average F1-score and standard deviation. Figure 1 indicates that the RF classifier struggles to detect unknown attacks. The observed trend shows that the F1-score for the attack class significantly decreases as samples of more attack classes are excluded while training the RF model.

Refer to caption
(a) F1-score for attack class on NSL-KDD
Refer to caption
(b) F1-score for attack class on UNSW-NB15
Figure 1: Impact of removing attack types from training datasets in the RF classifier

This observation underscores the ineffectiveness of supervised learning in detecting zero-day or previously unseen attacks. Most of the unseen attacks are classified as normal, which can be catastrophic in real-world applications. To tackle this challenge, we explored an alternative approach: training a supervised model using artificially generated data that is uniformly distributed in the feature space and labeled as the ”attack” class (simulated attack instances). This strategy allows the model to recognize unseen/unknown attacks without direct training on them. However, our experiment reveals while this approach increases the performance of supervised methods like RF in detecting some unknown attacks, the improvement is not significant enough in high-dimensional real-world datasets to be useful in practical applications.

Considering supervised models’ ineffectiveness in real life situation, we investigate semi-supervised techniques that are trained using available normal data to detect attack data. The semi-supervised learners are closely associated with the subfield of machine learning known as one-class classification (OCC). OCC algorithms aim to model a ”normal” class to distinguish unknown data as either normal or attack. These techniques are particularly useful for cyberattack detection where training data from the normal class is easily available (because most of the network traffic is benign) while the availability and acquisition of training data from attack classes are limited and challengingKhan and Madden (2014).

In the field of IDS, several researchersBezerra et al. (2019); Fahad et al. (2017); Anand and Saifulla (2023); Dini et al. (2022) investigated several OCC methods, including Local Outlier Factor (LOF)Breunig et al. (2000), One-Class SVM (OCSVM)Schölkopf et al. (1999), Isolation Forest (IF)Liu et al. (2008), and Elliptic Envelope (EE)Rousseeuw (1985). However, these studies often involve several other steps such as feature engineering and experiments are conducted using a limited set of datasets, which cannot fully represent the effectiveness of one-class classification in this domain. The results presented in these existing studies could be the impact of feature engineering, which may overshadow the true effectiveness of the OCC techniques. Also, empirical evaluations in these studies are limited to a couple of datasets. We require to examine the performance of these models using a wide range of contemporary IDS benchmark datasets. Moreover, more recent state-of-the-art studies have not yet examined the efficacy of newly developed advanced OCC methods like usfAD (Unsupervised Stochastic Forest based Anomaly Detector) Aryal (2018) and various forms of their ensemble techniques, in the context of network intrusion detection. In Aryal et al. (2021), usfAD has been shown to work well particularly in two cybersecurity datasets. Sunil et al. Aryal (2018) developed and applied usfAD to generate scores for detecting outliers. However, we cannot directly adopt usfAD from aryal2018usfad to detect network attack samples. Here, we have modified usfAD as OCC by introducing a threshold formula to detect attack categories in IDS.

In this article, our investigation includes a new OCC technique called usfAD and comparing its performance with other popular models. In addition, we construct several ensemble approaches by combining usfAD and other OCC methods to detect network attacks with higher accuracy.We present ensemble approaches, namely Any One, Two, Three, Four, and Five, incorporating our usfAD model along with other state-of-the-art OCC models. The primary objective of these ensemble approaches is to minimize the false-negative rate. The choice of adopting Any One, Two, Three, or Four ensemble approaches depends on the system’s resilience against attacks. By exploring these approaches, we aim to provide a more comprehensive understanding of the effectiveness of one-class classification in the context of IDS.

Our contribution can be summarized as follows.

  • We conduct a new experiment to assess the effectiveness of supervised binary classification in detecting unknown or zero-day attacks and investigate the performance of supervised learning to detect unseen attacks by training it using artificially generated attack instances.

  • We develop a semi supervised based IDS system for detecting zero-day attacks with higher accuracy. The model includes usfAD and other popular OCC methods and their ensembles. We obtain decision scores from OCC models for each training and testing instance and formulate a customized threshold to classify network instances as benign or attacks. Our results demonstrate that recently proposed robust outlier detection technique called usfAD achieves higher accuracy with our outlier threshold across the majority of benchmark datasets employed in this study.

  • We implement and test the model using 10 widely used benchmark IDS datasets to demonstrate its effectiveness in detecting attack instances. We employed 10-runs stratified 80/20 splits to evaluate the model’s performance in terms of average accuracy, precision, recall, and F1-score for each modern benchmark IDS dataset. We compare our results with several state-of-the-art works and our findings show that our approach outperforms the existing works.

The structure of this paper is as follows: Section 2 presents a review of related literature. Section 3 details OCC classification architecture and materials used in this study. In Section 4, we present the results of our experiments. Finally, Section 5 summarizes the paper and outlines potential future research directions.

2 Related works

Most of the existing literature Bezerra et al. (2019); Anand and Saifulla (2023); da Silva et al. (2016); Dini et al. (2022) has primarily focused on detecting attack and normal network instances, which require both normal and attack samples for training. However, in real-life situations, obtaining attack samples is challenging, while normal samples are more readily available. In addition, most existing IDSs built on supervised learning also necessitate correctly labelled data. This makes them unsuitable for real-time use, as they are only able to identify known attacks and are unable to identify novel attack patterns that are not present in their trained dataset.

However, some IDS models have been built with one class classifier algorithms, which does not require labelled data. These models perform well on some datasets, such NSL-KDD and UNSW-NB15, but they become less effective when used on more recent datasets, like ToN-IoT-Network, CIC-DDoS2019, XIIOTID, and others. The existing models display high false-negative rates, which is harmful for security-related applications where an attack could affect the system as a whole. Consequently, such an approach might not be suitable for real-life scenarios. In this section, we begin by analyzing the most recent OCC methods that are relevant to our study. One-class classifiers possess the capability to train a model without relying on labeled samples of malicious activities. Unlike traditional classifiers that model multiple predefined patterns to evaluate the conformity of new instances, one-class classifiers focus on modeling a single pattern and use it to determine the membership of new instances to that pattern. This approach proves advantageous in the context of IoT devices, which typically exhibit specific behavior characterized by the execution of straightforward tasks while efficiently utilizing computational resourcesBezerra et al. (2019).

Extensive research works have been presented in IDS field for detecting attacks using OCC models. For example, Umer et al.Fahad et al. (2017) assessed the performance of one-class classification techniques for early detection of malicious flows in a multi-stage flow-based intrusion detection system. The initial stage involved the utilization of minimal flow to classify IP flows as normal or malicious. One-class classification was employed, focusing solely on the malicious class. The performance of the classifier was evaluated using a test dataset comprising both normal and malicious IP flow records. Performance measures such as the Area under the Receiver Operating Characteristic (ROC) curve (AUC) and the F1 score were utilized for result comparison. The findings highlighted the superior accuracy of SVM-based one-class classifiers in detecting malicious IP flows. The v-SVM achieved an AUC of 0.9297 and an F1 score of 0.9114. Based on their experimental results, SVM-based one-classification techniques were deemed suitable for identifying malicious IP flows.

Anand et al.Anand and Saifulla (2023) introduced a Machine Learning-based IDS to detect slow rate HTTP/2.0 Denial of Service (DoS) attacks. They extracted 15 essential features from the datasets. The datasets with minimized number of features were fed into three One-class classifier algorithms: OCSVM, IF, and Minimum Covariant Determinant (MCD). The proposed classifier algorithm outperformed other algorithms in terms of various evaluation measures, including accuracy (0.99), sensitivity (0.99), and specificity (0.99). This highlights the superior performance of their approach in detecting slow rate HTTP/2.0 DoS attacks. An inherent limitation of this study pertains to the training of the one-class classifiers using datasets specific to particular attack types. This constraint arises from the significant variability that characterizes real-world attack scenarios.

The researchersDini et al. (2022) employed the one-class classifier approach to tackle the challenge of anomaly detection in communication networks. They introduced a novel anomaly detection algorithm that incorporated polynomial interpolation and statistical analysis in its design. This innovative method was applied to well-known datasets widely used in the scientific community, including KDD99, UNSW-NB15, and CSE-CIC-IDS-2018. Additionally, the algorithm was evaluated using a newly available dataset called EDGE-IIOTSET 2022. The study findings showcased that their methodology outperformed traditional one-class classifiers (such as Extreme Learning Machine and Support Vector Machine models) as well as rule-based intrusion detection systems like SNORT in terms of performance.

Wan et al.Wan et al. (2017) introduced a one-class classification anomaly-IDS system method specifically designed for networked control systems, focusing on the dual behavior characteristics. Their approach aimed to provide a clear and understandable solution. By leveraging the unique features of industrial communication, the study aimed to identify and diagnose two specific industrial communication behaviors using a dual one-class classifier approach. The primary objective was to accurately summarize industrial communication behaviors. To accomplish this, the authors proposed the utilization of one-class classifiers, namely OCSVM and RE-KPCA (Reconstruction Error based on Kernel Principal Component Analysis), to detect and classify misbehaviors in industrial communication. They incorporated a weighted mixed kernel function and employed PSO (Particle Swarm Optimization) parameter optimization to enhance the classification performance. The average accuracy across all three attack types was approximately 83.45%. Furthermore, the duration of each attack type was similar, with an average duration of approximately 26.11 seconds for all three attack types.

Khraisat et al.Khraisat et al. (2020) developed a Hybrid Intrusion Detection System (HIDS) by combining the C5 decision tree classifier with One Class Support Vector Machine (OC-SVM). The HIDS is designed as a hybrid system that merges the capabilities of Signature-based Intrusion Detection System (SIDS) and Anomaly-based Intrusion Detection System (AIDS). The SIDS algorithm was derived from the C5.0 Decision tree classifier, while the AIDS algorithm was derived from the one-class Support Vector Machine (SVM). The primary objective of this framework was to accurately detect both well-known intrusions and zero-day attacks while minimizing false alarms. To evaluate the effectiveness of the HIDS, the researchers utilized two benchmark datasets: the Network Security Laboratory-Knowledge Discovery in Databases (NSL-KDD) dataset and the Australian Defence Force Academy (ADFA) dataset. The proposed technique successfully integrated the two stages, resulting in outstanding performance with an accuracy rate of 83.24%.

We summarized the prior work related to this paper in Table 1.

Table 1: Overview of the related literature.
Ref Datasets Models Remarks
Fahad et al. (2017) CTU-13 dataset
Density Estimation: Simple
Gaussian, Mixture of Gaussian,
Parzen density estimation,
Reconstruction Methods: AE,
SOM, PCA, Boundary
Methods: v-SVM, SVDD
v-SVM achieved the highest performance.
da Silva et al. (2016)
SCADA(Supervisory Control
and Data Acquisition) systems
SVM, SVDD(Support Vector
Data Description)
OCSVM was found to perform better.
Anand and Saifulla (2023) Slow rate DoS data
SVM, IF, LOF ,
Elliptic Envelope(EE)
EE achieved the higher accuracy
for slow rate HTTP DoS attack.
Dini et al. (2022)
KDD99, UNSW-NB15, CSE-
CIC-IDS-2018, EDGE-
IIOTSET 2022.
SVM and Extreme Learning
Machine (ELM) with PCA and
Polynomial Interpolation
PCA based feature selection was applied.
Outcome does not reflect the effectiveness
of the SVM and ELM.
Wan et al. (2017) SCADA System
SVM-Mixed Kernel, Guassian
Kernel, Polynomial kernel
Mixed kernel based SVM produced higher
accuracy.
Khraisat et al. (2020) NSL-KDD, ADFA C5+OCSVM
Stacking ensemble of C5 and One Class
SVM were applied.
Al-Qudah et al. (2023) Malmem2022 PCC+OCSVM
PCC was used to select the most important
features.
Min et al. (2021)
NSL-KDD, UNSW-NB15,
CICIDS 2017
OCSVM, AE, MemAE,
SparseMemAE
MemAE achieved the higher accuracy.
Mhamdi et al. (2020) NSL-KDD SAE-SVM
This approach merged stack
auto encoder and one class SVM.
Nguyen et al. (2018) KDD +99 Nested OCSVM Multiple OCSVM was applied.
Arregoces et al. (2022) UNSW-NB
SVM, IF, LOF, EE
SVM achieved the higher accuracy.
Xu et al. (2021) NSL-KDD AE
The work investigated the performance of
AE by varying its parameter.
Alazzam et al. (2022)
KDDCUP-99, UNSW-NB15 ,
NSL-KDD
OCSVM with Pigeon inspired
optimizer
Across all datasets, the author showcased
elevated accuracy levels. However, it’s
noteworthy that achieving such elevated
accuracy is not common among
researchers, particularly when dealing
with the intricate nature of UNSW-NB15.
Bezerra et al. (2019) BoT-Net
(EE, IF, LOF, and One-class
SVM)
Local Outlier and One Class SVM
achieved the higher accuracy.

Proposed OCC

NSL-KDD, UNSW-NB15,
ISCXURL2016, Darknet2020,
Malmem2022, ToN-IoT-
Network, CIC-DDoS2019,
CIC-DoS2017, XIIOTID,
ToN-IoT-Linux
LOF, One Class SVM, IF,
usfAD, AE and VAE
usfAD is found to be performing well
across all the datasets

In this study, we first aim to investigate methods for employing supervised learning techniques to identify zero-day or previously unknown attacks. However, the challenges of training supervised models in the context of IDS are multiple. Foremost, obtaining an extensive dataset with accurately labeled normal and attack instances is often expensive or unfeasible in real-world scenarios. Although it is possible to acquire accurately labeled attack instances in IDS, these attack instances are infrequent, leading to a skewed class distribution while training a supervised learner. Most existing literature addresses this imbalanced issue by adopting techniques such as under-sampling of normal instances and over-sampling of attack instances. However, this approach tends to generate samples that mirror the distribution of familiar attack patterns. Given that future attacks can manifest in any region of the feature space, diverging from known distributions, the supervised models struggle to generalize over all potential attack patterns when trained on datasets created using conventional balancing techniques. To overcome these challenges, we suggest two primary strategies in this article: i) We can train a supervised algorithm incorporating simulated attack data spread across the entirety of the feature space. This methodology, employed by Aryal et al. Aryal and Wells (2021), focuses on anomaly detection within the data, ii) We can employ techniques that require only the normal or benign data samples, bypassing the need for attack instances entirely. This approach is more suitable in the IDS domain, where normal data samples are more readily accessible.

3 The Proposed IDS Framework

In this paper, we discuss two approaches of detecting unknown attacks ( here unknown attacks mean those attacks that are not seen by the model during training phase but appear in the testing datasets). First approach of detecting unknown attack is to train a supervised learner by incorporating dummy attack instances (randomly generated) with original training datasets. The second approach is to adopt a new OCC algorithm called usfAD and different ensembles of the usfAD and other state-of-the-art OCC algorithms to detect unknown attacks.

In this section, firstly, we train a supervised model using both known normal and attack instances, incorporating uniformly distributed noise data (simulated data) throughout the feature space and labeling them as an ”attacks”. This approach is adopted from Aryal et al. Aryal and Wells (2021), who introduced the notion of identifying outliers across the local regions of entire feature space. However, their research did not delve into its application in IDS, particularly for classifying novel attack types. Given this, there is an opportunity to explore the idea of integrating noise labeled as ”attacks” to identify previously unknown attacks in IDS. In theory, a model learned with such simulated data should possess the capability to identify a broad range of attack instances. As a result, it is anticipated that unknown attacks might be detected as they might align with the distributions learned by the model using the simulated data.

Secondly, one-class classification (OCC) emerges as a more suitable choice, wherein the model is trained only using normal or benign data. In this paper, we utilize a new OCC method dubbed as usfAD. Below, we first describe the methodology of a supervised model’s efficiency in detecting unseen attack instances prior to discussing our OCC based framework.

3.1 Methodology for supervised model’s Effectiveness in Detecting Unknown Attack

Prior to explaining the intrusion detection system centered on OCC models, we outline the experimental procedure undertaken to evaluate the potential of supervised learning in identifying unknown attacks without including any noise data or simulated attack data. As a representative of supervised learning, we employ the widely recognized and effective classifier: Random Forest (RF), training and testing it with contemporary benchmark IDS datasets. To assess RF’s capability in detecting unknown attacks, we train it using datasets where instances of a specific attack type are removed from the training data while retaining that attack type within the testing data.

Below, we describe the methodological steps of this experiment, aimed at assessing the effectiveness of RF in detecting unknown attacks. We consider a dataset consisting of benign (b) instances and different types of attacks (a1, a2, a3, a4). The process’s methodology and algorithm are depicted in Figure 2 and Algorithm 1, respectively.

Refer to caption
Figure 2: Experiment of Random Forest to detect unknown attacks
  • In this initial step, we create combinations of attacks for omitting from the datasets, such as single attacks a1, a2, a3, and a4, two-attack combinations {a1, a2}, {a1, a3}, {a1, a4}, {a2, a3}, {a2, a4}, {a3, a4}, three-attack combinations {a1, a2, a3}, {a1, a3, a4}, {a2, a3, a4}, {a1, a2, a4}, four-attack combination {a1, a2, a3, a4} and so on.

  • For omitting each combination, we conduct stratified 10-runs and calculate the average accuracy over these 10 folds. Then, we compute the average accuracy for each combination of varying lengths (one, two, three, or four attacks). For instance, we calculate the average accuracy, precision, recall, and F1-score for one attack combination by summing the 10-fold performance metrics of a1, a2, a3, a4 and dividing by 4, since there are four distinct combinations with a single attack type. A similar methodology is applied for combinations of two attack types and so forth.

  • Finally, we represent the accuracy and F1-score values from each length of combination (one, two, three, and four attacks) in a graph (as illustrated in Figure 1) to visualize the results. This graph provides insights into the performance of the model based on different combinations of attacks.

Input : Dataset with instances of benign (b𝑏bitalic_b) and attacks (a1,a2,a3,,am𝑎1𝑎2𝑎3𝑎𝑚a1,a2,a3,...,amitalic_a 1 , italic_a 2 , italic_a 3 , … , italic_a italic_m)
Output : Accuracy values for different attack combinations
1 for k=1𝑘1k=1italic_k = 1 to m𝑚mitalic_m do
2       Generate all k𝑘kitalic_k-attack combinations {C1,C2,,Cn}subscript𝐶1subscript𝐶2subscript𝐶𝑛\{C_{1},C_{2},...,C_{n}\}{ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT };
3       for i=1𝑖1i=1italic_i = 1 to n𝑛nitalic_n do
4             Compute the accuracy of combination Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using stratified 10-runs;
5             Calculate the average accuracy over 10 folds and store it as Acci𝐴𝑐subscript𝑐𝑖Acc_{i}italic_A italic_c italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
6            
7       end for
8      Compute the average accuracy for all k𝑘kitalic_k-attack combinations: Avg(Acck)=1ni=1nAcci𝐴𝑣𝑔𝐴𝑐subscript𝑐𝑘1𝑛superscriptsubscript𝑖1𝑛𝐴𝑐subscript𝑐𝑖Avg(Acc_{k})=\frac{1}{n}\sum_{i=1}^{n}Acc_{i}italic_A italic_v italic_g ( italic_A italic_c italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_A italic_c italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT;
9      
10 end for
11Plot a graph with x𝑥xitalic_x-axis representing the number of attacks in a combination (k=1,2,3,,m)𝑘123𝑚(k=1,2,3,...,m)( italic_k = 1 , 2 , 3 , … , italic_m ), and y𝑦yitalic_y-axis representing the corresponding average accuracy Avg(Acck)𝐴𝑣𝑔𝐴𝑐subscript𝑐𝑘Avg(Acc_{k})italic_A italic_v italic_g ( italic_A italic_c italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT );
Algorithm 1 Attack Combination for 10 runs

3.2 Methodology for Supervised Model’s Effectiveness in Detecting Unknown Attack with Simulated Attack Data

In this case, we incorporate a certain number of simulated attack data (also called noise data) with the original training datasets. We adopt the similar methodology to validate the efficacy of the RF model in identifying unknown attacks, both with (described in the previous section) and without noise data. In this scenario, we first also form a training datasets after excluding specific combinations of original attack types where removing original attack categories is done based on the previously discussed approach. Next, we incorporate a predetermined quantity of noise instances (dummy or simulated attack data) with the training datasets. Here, we deliberately remove instances of certain attack types from the training datasets and then add simulated attack data so that the model can detect the removed attack types from the training datasets. In real life scenario, we might have enough samples of various types of attack data. So, we expect that a supervised model trained with diverse simulated attack instances might be capable of detecting unknown attacks. The count of dummy attack instances in every combination, where we remove certain attack categories from the training dataset, is defined as:

Noisen=Nn𝑁𝑜𝑖𝑠subscript𝑒𝑛subscript𝑁𝑛Noise_{n}=N_{n}italic_N italic_o italic_i italic_s italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where Nnsubscript𝑁𝑛N_{n}italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT represents the count of regular instances in the original dataset.

This strategy ensures an even distribution between normal and attack data. Given Ansubscript𝐴𝑛A_{n}italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as the count of attack instances and excluding certain combinations denoted as Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and subsequently adding the noise or simulated instances categorized as attacks, the entire count of instances in the training datasets becomes:

Tnsubscript𝑇𝑛T_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = Nn+(AnCn)+Noisensubscript𝑁𝑛subscript𝐴𝑛subscript𝐶𝑛𝑁𝑜𝑖𝑠subscript𝑒𝑛N_{n}+(A_{n}-C_{n})+Noise_{n}italic_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + ( italic_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_N italic_o italic_i italic_s italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

For every Cnsubscript𝐶𝑛C_{n}italic_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT combination, we incorporate the same set of uniformly distributed, randomly generated noise data.

3.3 Methodology of usfAD Based IDS

In this study, our objective is to develop an OCC based system for the detection of attack instances. The architecture of the model is illustrated in Figure 3. The benefits of the OCC model are in two folds: this does not necessitate attack samples for training, which overcomes the issue of the limited availability of attack samples in supervised learning. Besides, the dynamic nature of attack characteristics often leads to poor performance of supervised models trained on specific attack types. In addition, supervised learners demand datasets with accurate class labels, which may not be feasible in real-world scenarios. OCC based intrusion detection system addresses these challenges by eliminating the need for explicit dataset labeling and attack samples.

To train the new OCC algorithm called usfAD, we need datasets having only normal or regular network traffic instances. To form a new training dataset devoid of attack instances, we remove all attack instances from the original training dataset. This modified training dataset is used to train usfAD and other standard OCC algorithms including LOF, One-Class SVM, IF, VAE and AE. We also form different ensemble models using the trained OCC models. During the testing phase of these models, any data instance that is not classified as normal is considered as attack instance. The algorithm of the OCC model, we implement in this article is presented in Algorithm 2. Our OCC paradigm is elaborately described below.

Refer to caption
Figure 3: Architecture of One class classification for IDS
Input : IDS dataset with labeled normal and attack data points, n = number of models, m = number of testing instances, CL = Consensus Level( Any One, Two , Three ensemble and so on
Output : Predicted outcomes for individual models and ensemble models
1 for f=1𝑓1f=1italic_f = 1 to 10101010 do
2       Split the dataset into 80% Xtrain[f], Ytrain[f] and 20% Xtest[f], Ytest[f] ;
3       Extract normal points(Xtrainsubscriptsuperscript𝑋𝑡𝑟𝑎𝑖𝑛X^{\prime}_{train}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and Ytrainsubscriptsuperscript𝑌𝑡𝑟𝑎𝑖𝑛Y^{\prime}_{train}italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT) from Xtrain[f] and Ytrain[f];
4       for i=1𝑖1i=1italic_i = 1 to n𝑛nitalic_n do
5             MT[i]𝑀𝑇delimited-[]𝑖MT[i]italic_M italic_T [ italic_i ]: Trained model using Xtrainsubscriptsuperscript𝑋𝑡𝑟𝑎𝑖𝑛X^{\prime}_{train}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and Ytraninsubscriptsuperscript𝑌𝑡𝑟𝑎𝑛𝑖𝑛Y^{\prime}_{tranin}italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_i italic_n end_POSTSUBSCRIPT ;
6             Generate training data score: Strain[i] = score_sample(MTi𝑀subscript𝑇𝑖MT_{i}italic_M italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,Xtrainsubscriptsuperscript𝑋𝑡𝑟𝑎𝑖𝑛X^{\prime}_{train}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT) ;
7             Compute threshold: TH[i]=μ(Strain)3×σ(Strain)𝑇𝐻delimited-[]𝑖𝜇subscript𝑆𝑡𝑟𝑎𝑖𝑛3𝜎subscript𝑆𝑡𝑟𝑎𝑖𝑛TH[i]=\mu(S_{train})-3\times\sigma(S_{train})italic_T italic_H [ italic_i ] = italic_μ ( italic_S start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ) - 3 × italic_σ ( italic_S start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT ) ;
8             Generate testing score: Stest[i] = score_sample(MTi𝑀subscript𝑇𝑖MT_{i}italic_M italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,Xtrainsubscriptsuperscript𝑋𝑡𝑟𝑎𝑖𝑛X^{\prime}_{train}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT) ;
9             for j=1𝑗1j=1italic_j = 1 to m𝑚mitalic_m do
10                   if Stest[j]subscript𝑆𝑡𝑒𝑠𝑡delimited-[]𝑗S_{test}[j]italic_S start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT [ italic_j ] \leq TH[i]𝑇𝐻delimited-[]𝑖TH[i]italic_T italic_H [ italic_i ]  then
11                         predict[i][j]=1𝑝𝑟𝑒𝑑𝑖𝑐𝑡delimited-[]𝑖delimited-[]𝑗1predict[i][j]=1italic_p italic_r italic_e italic_d italic_i italic_c italic_t [ italic_i ] [ italic_j ] = 1
12                  else
13                         predict[i][j]=0𝑝𝑟𝑒𝑑𝑖𝑐𝑡delimited-[]𝑖delimited-[]𝑗0predict[i][j]=0italic_p italic_r italic_e italic_d italic_i italic_c italic_t [ italic_i ] [ italic_j ] = 0
14                   end if
15                  
16             end for
17            
18       end for
      ;
        // Outer loop defines different consensus levels (Any One, Two and so on)
19       for k=1𝑘1k=1italic_k = 1 to CL𝐶𝐿CLitalic_C italic_L do
20             for j=1𝑗1j=1italic_j = 1 to m𝑚mitalic_m do
                   ;
                    // Go through each instance to make a prediction.
21                   Initialize attack count A=0𝐴0A=0italic_A = 0;
                   ;
                    // Loop over the predictions from each model
22                   for i=1𝑖1i=1italic_i = 1 to n do
23                         if predict[i][j]𝑝𝑟𝑒𝑑𝑖𝑐𝑡delimited-[]𝑖delimited-[]𝑗predict[i][j]italic_p italic_r italic_e italic_d italic_i italic_c italic_t [ italic_i ] [ italic_j ]==1 then
24                               Increment A𝐴Aitalic_A;
25                              
26                         end if
27                        
28                   end for
                  ;
                    // the number of positive predictions is at least k ( consensus level)
29                   if Ak𝐴𝑘A\geq kitalic_A ≥ italic_k then
30                         enspredict[k][j]=1𝑒𝑛𝑠𝑝𝑟𝑒𝑑𝑖𝑐𝑡delimited-[]𝑘delimited-[]𝑗1enspredict[k][j]=1italic_e italic_n italic_s italic_p italic_r italic_e italic_d italic_i italic_c italic_t [ italic_k ] [ italic_j ] = 1;
31                        
32                  else
33                         enspredict[k][j]=0𝑒𝑛𝑠𝑝𝑟𝑒𝑑𝑖𝑐𝑡delimited-[]𝑘delimited-[]𝑗0enspredict[k][j]=0italic_e italic_n italic_s italic_p italic_r italic_e italic_d italic_i italic_c italic_t [ italic_k ] [ italic_j ] = 0;
34                        
35                   end if
36                  
37             end for
38            
39       end for
      ;
        // Classification report for individual models
40       pfsubscript𝑝𝑓p_{f}italic_p start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = classification_report(predictf𝑝𝑟𝑒𝑑𝑖𝑐subscript𝑡𝑓predict_{f}italic_p italic_r italic_e italic_d italic_i italic_c italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, Ytestf𝑌𝑡𝑒𝑠subscript𝑡𝑓Ytest_{f}italic_Y italic_t italic_e italic_s italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT]);
       ;
        // Classification report for ensemble models
41       enspf𝑒𝑛𝑠subscript𝑝𝑓ensp_{f}italic_e italic_n italic_s italic_p start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = classification_report(enspredictf𝑒𝑛𝑠𝑝𝑟𝑒𝑑𝑖𝑐subscript𝑡𝑓enspredict_{f}italic_e italic_n italic_s italic_p italic_r italic_e italic_d italic_i italic_c italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, Ytestf𝑌𝑡𝑒𝑠subscript𝑡𝑓Ytest_{f}italic_Y italic_t italic_e italic_s italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT]);
42      
43 end for
44Calculate average performance for 10-folds ;
Algorithm 2 Proposed One Class Classifier model
  • To prepare our IDS dataset: we require to make sure that the dataset is labeled with normal and attack data points, including different attack types.

  • Train OCC Models: We choose usfAD and multiple popular OCC techniques, such as LOF, IF, OCSVM, VAE and AE (M1(M_{1}( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT,…,Mn)M_{n})italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). We train each model using only the normal instance from the IDS dataset. We obtain training score for every instance in the training set and form a threshold score to determine the attack instance. The threshold (TH) is computed as follows: mean (training score)-3× standard deviation (training score) or TH=μ3×σ𝑇𝐻𝜇3𝜎TH=\mu-3\times\sigmaitalic_T italic_H = italic_μ - 3 × italic_σ based on the standard 3 sigma rule of statistics. By calculating the threshold based on the mean and standard deviation of the training scores, a boundary is established. Instances with scores above this threshold are considered normal, while instances with scores below the threshold are flagged as potential attacks. For illustrative purposes, Figure 4 displays the decision score derived from the NSL-KDD training datasets, along with its mean and threshold values. As evident from Figure 4, the green line demarcates a boundary, determining whether scores from testing instances are categorized as an attack or normal.

    Refer to caption
    Figure 4: Threshold on the score of training datasets in usfAD model
  • Ensemble Model Formation: Subsequently, we construct five different ensemble models by utilizing the prediction results from each individual OCC model. Our ensemble models are: one-model, two-model, three-model, four-model, and five-model ensemble for 10 different datasets. In the case of the one-model ensemble, an attack outcome is determined if any one of the five individual models predicts an attack. Correspondingly, for the two-model ensemble, an attack outcome is derived if at least two of the five individual models predict an attack, and this pattern continues for the other ensemble configurations.

  • Performance Determination: Performance for each trained OCC models and their ensemble models is assessed using stratified 10-runs. In this stage, every model determines a traffic instance as attack if the score of the instance is less than the threshold (TH) which is computed based on the scores of the training datasets. We also consider the default prediction outcome of LOF, IF, OCSVM, VAE and AE while the outcome -1 interpreted as 1 or attack and 1 interpreted as 0 or normal.

3.4 Datasets and Pre-Processing

We convert categorical values into numerical values, handle missing values by mean imputation, scale the features’ values using max-min normalisation. We did not perform other pre-processing techniques such as dealing with outliers, and addressing multicollinearity. Our objective is to evaluate the efficacy of OCC system with minimal data pre-processing, ensuring that the results primarily reflect the intrinsic nature of this approach. We briefly describe the datasets that are used to train our usfAD based model.

  • NSL-KDD datasetSu et al. (2020) was designed to overcome the issues with KDD’99 dataset. This updated version of the KDD data set is still regarded as an effective benchmark dataset for researchers to compare different intrusion detection approaches. The NSL-KDD training and testing sets have a balanced quantity of records for benign and attack samples. The shape of the datasets is (148517, 44). Three categorical features (protocol_type, service, flag) are converted into numerical features using one hot encoding system.

  • UNSW-NB15 dataset: The Network Security Research Lab at the University of New South Wales, Australia, built the UNSW-NB15 dataset by capturing network traffic in a realistic setting using a high-speed network sniffer and various tools and techniques such as packet flooding, port scanning, and SQL injection. The original dataset contains 257,673 records and 45 fields. The three categorical features (proto, service, state) are converted into numerical features using one hot encoding method.

  • Canadian Institute for Cybersecurity released CIC-IDS2017 datasetJazi et al. (2017) which is a benchmark dataset for Intrusion Detection System. The dataset includes user behaviour models that are protocol-agnostic through HTTP, HTTPS, FTP, SSH, and email. The dataset consists of 222914, and 78 features having four classes: benign samples, DoS SlowLoris samples, DoS Slow Httptest samples, DoS Hulk samples, DoS GoldenEye samples, and Heartbleed samples in the output class label.

  • CIC-DDoS2019: The Canadian Centre for Cybersecurity at the University of New Brunswick created a dataset of DDoS attacks called CIC-DDoS2019. This data set contains both normal traffic patterns and a wide variety of distributed denial of service (DDoS) assaults, such as UDP flood, HTTP flood, and TCP SYN. The shape of the dataset is(431371, 79) where attack instances are 333540 and benign instances are 97831.

  • Malmem2022: Obfuscated malware hides them to avoid detection and elimination using conventional anti-malware software. Malmem 2022Carrier et al. (2022) is a simulated obfuscated dataset designed to be realistic as possible to train and test machine learning algorithms to detect obfuscated malware. The dataset is balanced one having level 2 categories: Spyware, Ransomware, and Trojan Horse.

  • ToN-IoT-Network and ToN-IoT-Linux: ToN-IoT was extracted from a realistic large scale IoT simulated environment at the Cyber Range Lab led by ACCS in 2019. The dataset contains a heterogeneous telemetry IoT services, traffic flows, and logs of operating system. Later, Bro-IDS known as Zeek having 44 features was formed from the original dataset considering the network traffic flows. Label encoding is used to convert its categorical features into numerical features following Moustafa (2021) and Guo et al. (2023). These datasets contain IP address. We can treat each unique IP address as a category and perform one-hot encoding. Although this is theoretically possible, it’s usually not practical for real-world IDS systems due to the vast number of unique IP addresses, which leads to extremely high-dimensional data.

  • ISCXURL2016: In WWW web, URLs serve as the primary mode of transport and attackers insert malware into users’ computer system through URL. The researchers focus on developing methods for blacklisting malicious URLs. Mamun et al. Mamun et al. (2016) formed a modern URL dataset that contains following categories of URLs: benign URLs, spam URLs, phishing URLs, malware URLs and defacement URLs. The shape of the original datasets is (36707, 80).

  • CIC-Darknet2020: CIC-Darknet dataset has 141530 records with 85 columns features and was labelled in two ways. We apply label encoding to convert its categorical features into numerical values.

  • XIIoTID: The XIIoTID datasetAl-Hawawreh et al. (2022) has an initial shape of (596017, 64). The dataset has features from network traffic, system logs, application logs, device’s resources (CPU, input/Output, Memory, and others), and commercial Intrusion detection systems’ logs (OSSEC and Zeek/Bro). Upon performing one-hot encoding on these features, the total number of features increases to 81.

3.5 Dataset Partitioning and Training

We do 10 runs of random stratified 80/20 splits to preserve the balanced proportion of each class in each run. This approach provides a more accurate estimate of model performance, particularly when working with imbalanced datasets in which one class is more samples than the other. In this study, we trained and evaluated both the RF and OCC models using the same 10 runs of random stratified 80/20 splits.

3.6 Semi-supervised Outlier Detection Algorithms

In this study, we explore the efficacy of usfAD and other OCC models including LOF, OCSVM, IF, VAE (Variational Autoencoder) and AE (Auto Encoder) to distinguish detect benign and unseen attack instances within IDS datasets. LOF, OCSVM, IF, VAE and AE are well-established outlier detection techniques, and their implementations are sourced from the scikit-learn and PyODZhao et al. (2019b) machine learning library. The implementations of usfAD is obtained from Aryal et al.Aryal et al. (2021). Below, we provide a concise overview of each OCC model.

  • LOF algorithmBreunig et al. (2000) is a density-based anomaly detection method. LOF measures the local deviation of a data point with respect to its neighbors. The LOF compares the density of a data point to the densities of its neighbors. If the density of a data point is significantly lower than the densities of its neighbors, the point is likely to be an anomaly.

  • One-Class Support Vector Machines (One-Class SVM)Schölkopf et al. (1999) is an unsupervised machine learning algorithm that is primarily used for novelty detection. OCSVM finds a hyperplane in the feature space that separates the majority of data points from the origin (or a set margin away from the origin) with the largest possible margin. This essentially encompasses the majority of data points in a region, and anything that occurs outside of this region is regarded as an outlier or anomaly.

  • We adopt usfAD from Aryal et al.Aryal et al. (2021) who designed the algorithm based on ”Unsupervised Stochastic Forest” (USF) Fernando and Webb (2017) and Isolation ForestLiu et al. (2008). Unsupervised Stochastic Forest is a variation of unsupervised random forest. On the other hand, isolation forest, a variant based on the random forest model, offers swift anomaly detection without the dependency on density or distance measures, making it considerably faster than many conventional methods. This usfAD is a robust anomaly score generated technique that does not depend on the scales and units of the dataAryal et al. (2021).

  • Autoencoders (AE)Zhou and Paffenroth (2017) trained with normal points can reconstruct these points with minimal error, whereas anomalies or outliers result in higher error. If the reconstruction error surpasses a predefined threshold, the data instance is considered an anomaly or outlier. The AE learns the data distribution of the ’normal’ class, and deviations from this distribution (outliers) are more difficult to reconstruct precisely.

  • Variational Autoencoder (VAE)An and Cho (2015) is a variation of AE and a generative model that learns to encode and decode data in a way that it can be utilized to detect as anomaly or outlier. If the model is trained primarily on ”normal” data, it can reconstruct these normal samples accurately. In contrast, data points that deviate from trained normal data are reconstructed with higher error, thus identifying them outliers.

3.7 Experimental Setup and Implementation

Our experimental study was performed on an Intel Xeon E5-2670 CPU (8 cores, 16 threads), 128GB DDR3 RAM, 2x Nvidia GTX 1080 Ti. Python 3.9 was used to execute our code. The study utilized ten different machine learning models and primarily relied on Pandas and NumPy libraries for data pre-processing. Since the framework was developed using Python, the widely recognized Scikit-learn toolkit was utilized to implement popular outlier methods: LOF, OCSVM, IF and RF classifier. We obtained python code for usfAD from Aryal et alAryal et al. (2021).

3.8 Performance Metrics

In this study, we use accuracy, precision, recall and F1 score that are essential for assessing the performance of an IDS model. However, their significance can vary depending on the system’s specific objectives and requirements. Accuracy quantifies the proportion of accurate classifications made by the IDS. However, relying solely on accuracy is not the most suitable performance metric for IDS, as this might not accurately reflect the system’s capability to identify attacks, which are a minority class within the dataset. Precision refers to the proportion of genuine positive detection out of all positive detection. High precision is essential in IDS in order to minimise false positives, which can result in false alarms. Recall measures the system’s ability to reliably identify all instances of a particular class of attack. Low recall suggests that the system is missing some attacks, which can pose a significant security risk.

The F1 score is a combination of precision and recall that quantifies the proportion of true positive identification relative to the total number of positive instances in the dataset. F1-score is a valuable metric for IDS because this considers both false positives and false negatives and provides a balanced score between precision and recall. The accuracy, precision, recall and F1-score are calculated as follows.

Accuracy=TP + TNTP + TN + FP + FN×100AccuracyTP + TNTP + TN + FP + FN100\text{Accuracy}=\frac{\text{TP + TN}}{\text{TP + TN + FP + FN}}\times 100Accuracy = divide start_ARG TP + TN end_ARG start_ARG TP + TN + FP + FN end_ARG × 100
Precision=TPTP+FP×100PrecisionTPTP+FP100\text{Precision}=\frac{\text{TP}}{\text{TP+FP}}\times 100Precision = divide start_ARG TP end_ARG start_ARG TP+FP end_ARG × 100
Recall=TPTP+FN×100RecallTPTP+FN100\text{Recall}=\frac{\text{TP}}{\text{TP+FN}}\times 100Recall = divide start_ARG TP end_ARG start_ARG TP+FN end_ARG × 100
F1-score=2×Precision × RecallPrecision + Recall×100F1-score2Precision × RecallPrecision + Recall100\text{F1-score}=2\times\frac{\text{Precision $\times$ Recall}}{\text{Precision% + Recall}}\times 100F1-score = 2 × divide start_ARG Precision × Recall end_ARG start_ARG Precision + Recall end_ARG × 100

where TP = true positive, TN = true negative, FP = false positive, and FN = false negative.

4 Results and Discussion

In this section, we evaluate the effectiveness of the OCC model and RF classifier while detecting unknown attacks. We present the average accuracy, precision, recall, and F1-score for OCC and RF models.

4.1 Performance of Supervised Learning to Detect Unknown Attacks

In our experiment, we aimed to assess the effectiveness of RF model in detecting unknown attacks in the context of IDS. In this section, we present the performance of a RF trained with uniformly distributed synthetic noise data labeled as attacks and without noise data. First, we examine the effect of adding noise data in identifying attacks on a synthetic dataset with two features to see it visually. Second, we investigate the effectiveness of this approach on real IDS datasets with the two most important features. In this case, we consider real IDS datasets with two features because real datasets are intricate and adhere to varied distributions and when employing a random function to produce noise data with a substantial number of features, the noise does not span the entire 0 to 1 spectrum. Consequently, the presence of this noise data does not significantly influence the detection of unseen attacks. Our findings show that when the model is trained using datasets with a complete set of features, the performance remains the same for both RF with noise and without noise.

4.1.1 RF Model Trained on Synthetic Data with Noise

Our first solution to detect unknown attacks is to train a supervised model using random uniformly distributed data (labeled as an attack) along with the original dataset. To evaluate the impact of adding external noise to the results, we crafted synthetic datasets with a Gaussian distribution, visualized in Figure 5. Here, blue, crimson, and red clusters represent benign data, attack type 1, and attack type 2, respectively. We consider the synthetic datasets with two features, allowing for visualization of the model’s decision boundary and its predictive outcomes, illustrating its adaptability when encountering noise and its ability to identify unknown attacks.

Refer to caption
Figure 5: Simulated datasets for RF’s efficiency
Refer to caption
Figure 6: Training RF including noise instances and only normal instances
Refer to caption
Figure 7: Predicted outcome of RF with noises and normal data only

In this experiment, we present the synthetic dataset with one normal cluster and two types of attacks in Figure 5. To show the capability of RF model to detect unknown attacks, we simulate unknown attack type case by removing the middle attack type (crimson)(as shown in Figure 5) during the training stage as shown in Figure 6 (a). Figure 6 (b) present the decision boundary of the trained model with missing attack type 1 and without noise. Next, we add noise during the training stage to make up for the unknown attacks. Figure 6 (c) display the training datasets containing normal, attack type 2 and noise data (depicted as blue, red and orange points) and Figure 6 (d) shows the decision boundary of the trained RF model with noise.

In both case, we make attack type 1 unknown to the model. Notably, the decision boundary shown in Figure 6 (b) is more constricted than the one in Figure 6 (d), attributable to the introduction of noise data. In Figure 7 (a), we observe that all instances of attack type 2 and most of attack type 1 are correctly identified because of the noise data, even though the model was not specifically trained on attack type 2. Within this figure, a red square signifies a correctly classified attack, while a orange circle indicates a misclassification. Conversely, Figure 7 (b) portrays that a model trained with normal data and attack type 2 struggles to correctly identify attack type 1 as this attack type was not present in the training datasets. We can see that most of the attack type 2 are identified as normal (orange circle). Figure 8 (a) and (b) display the average macro accuracy and F1-score of the synthetic datasets for two scenarios: RF (noise), and standard RF. Meanwhile, the recall and F1-score focused on the attack class are depicted in Figure 8 (c) and (d). As observed from the plots in Figure 8, the RF trained with noise exhibits a superior ability in identifying unseen attacks compared to the standard RF.

Refer to caption
Figure 8: Performance on synthetic datasets

4.1.2 RF Model Trained with Benchmark IDS Data with Noise

The results showcased here are based on benchmark IDS datasets that incorporate only two features selected using Random Forest Li et al. (2020); Disha and Waheed (2022). We have chosen to illustrate the effect of noise on benchmark IDS datasets by focusing on just two features. This decision is based on the observation that using Random Forest (RF) on high-dimensional feature sets alongside noise leads to unstable results. Furthermore, introducing noise across a large number of dimensions incurs a substantial computational expense. This makes the process of incorporating noise into the training of a supervised model for the detection of unseen attacks impractical. Below, we present the outcome of training a RF model by adding noise data labelled as attack with benchmark IDS datasets. We consider the IDS datasets with the most important two features. The experimental results are depicted in Figure 9, which illustrates a gradual drop in accuracy as different types of attacks are removed from the training dataset, while the testing data contains all attack types. The methodology section explains the process of removing various attack categories from the training data and computing the accuracy to depict the graph. We conduct this experiment for 10 different benchmark IDS datasets including NSL-KDD, UNSW-NB15, ISCXURL2016, CIC-DoS2017, CIC-DDoS2019, CIC-Darknet2020, CIC-Malmem2022, ToN-IoT-Network, ToN-IoT-Linux, and XIIOTID datasets.

In Figure 9, we show outcome of RF models after inclusion of noise in identifying previously unknown attack samples. In order to empower the RF model to identify unknown attacks, we adopt a strategy that involves training the model with the original datasets combined with uniformly distributed random datasets. These additional datasets share the same number of features as the originals. In this approach, the added data instances are labeled as attacks, aligning with the goal of enhancing the model’s capability to recognize zero-day attacks. The count of noise records matches the number of normal instances. To evaluate the impact of introducing these random datasets on the RF model’s ability to detect unknown attack samples, we conducted training and testing using various training datasets. Each training dataset involves the removal of specific quantities of attack instances, but importantly, the introduced noise data (present in 80% of the training data) is retained in every case. The testing dataset, on the other hand, exclusively comprises original attack instances and do not include noise samples.

The graph depicted in Figure 9 showcases the outcomes of a series of experiments involving the removal of varying numbers of attack class instances during the model training process. On the x-axis, different scenarios are presented, each representing the removal of a specific count of attack type instances from the training dataset. The corresponding recall and F1-score for attack class for each scenario is depicted on the y-axis.

Refer to caption
Figure 9: RF’s performance in detecting unknown attacks

Figure 9 presents recall and F1-score of the RF model trained with and without noise data across scenarios where varying numbers of attack types are excluded. Both the RF model trained with noise and the one without noise exhibit identical performance, with a recall and F1-score of 83.03%, 72.69% and 100%, 100% for NSL-KDD, and ToN-IoT Network with the most important two features. This suggests that, when all attack types are present in the training data, introducing noise does not have a discernible positive or negative effect on the model’s recall and F1-score. For C3, the RF model with noise data maintains better performance (C3: 58.93%, 45.16% and 99.27%, 99.61% for NSL-KDD, and UNSW-NB15) compared to the RF model without noise (C3: 31.96%, 25.26% and 88.42%, 92.87%). As we move from C3 to C4, and C6 the gap in performance widens significantly. The trend suggests that as more attack types are excluded, the RF model without noise struggles more, highlighting the advantage of using noise in training for better generalization. At C4, the RF model with noise data manages to achieve a recall and F1-score of 38.36%, 24.85% for NSL KDD and 99.36%, 99.65% for ToN-IoT-Network datasets at C6. However, the RF model without noise fails entirely, resulting in 0% for both recall and F1-score on both datasets. This means the standard RF model couldn’t correctly classify any instances under this configuration, emphasizing the importance of the noise data when dealing with unseen attack types.

The RF model trained with noise data consistently shows superior performance compared to the one without noise, particularly when multiple attack types are removed. Introducing noise during training appears to bolster the model’s ability to recognize unfamiliar or less common attack variants. Such adaptability is vital in real-world contexts, where unpredictable attack types may arise.

Nonetheless, our experiments indicate that while introducing noise can enhance the effectiveness of supervised methods such as Random Forest (RF) in identifying certain novel attacks, the degree of improvement on high-dimensional real-world datasets is not substantial enough to warrant practical application. As depicted in Figure 10, there is essentially no discernible difference between the performance of the RF model with noise and without it. This outcome may stem from the fact that the randomly generated noise does not cover the full range of the binary space. Moreover, the generation of noise in a high-dimensional space entails considerable computational resources.

Refer to caption
Figure 10: RF’s accuracy in detecting unseen attacks with full featured datasets

As a result, we are now focusing on exploring one-class classifiers or outlier detection methods as an alternative approach. Our goal is to detect network intrusions and compare the performance of this approach with the conventional binary Random Forest classifier in a real-life scenario.

4.2 Performance of Semi Supervised Learning or OCC

In this section, we discuss the outcome of semi supervised or OCC methods and their ensemble approaches. Table 2 displays the accuracy and F1-score results for various semi-supervised learning algorithms, including LOF, IF, OCSVM, VAE, AE, and usfAD, along with different ensemble approaches. In Table 2, we present the accuracy evaluation conducted on 10 distinct datasets: NSL-KDD, UNSW-NB15, ISCXURL2016, Malmem2022 and CIC-DDoS2019, TON-IOT-Network, CIC-Darknet2020, CIC-DoS2017, XIIOTID, and ToN-IoT-Linux datasets. For the first three datasets—NSL-KDD, UNSW-NB15, and ISCXURL2016 dataset, we can notice that VAE and AE exhibited almost the same performance. For this reason, we chose to exclude it from evaluations on other large datasets since VAE demands substantial memory demands for large IDS datasets.

Table 2: Accuracy of outlier methods and their ensemble approaches
Models IDS Benchmark Datasets
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
LOF 87.50 80.87 80.95 88.00 86.60 98.19 93.10 83.90 78.49 96.70
VAE 89.03 54.65 73.26
AE 91.53 56.26 74.03 94.98 79.18 60.81 76.91 86.48 82.42 67.34
OCSVM 73.54 73.91 81.27 75.03 84.81 67.54 50.08 52.30 68.42 45.74
IF 90.28 56.95 65.09 90.36 70.00 63.21 78.30 88.94 71.05 66.58
usfAD 95.92 82.15 92.38 94.65 98.69 99.43 91.65 97.04 93.52 97.94
Ensemble-Any One, Two, Three, Four, and Five
Ensemble-1 71.16 75.01 85.74 70.00 85.50 66.24 55.43 44.97 69.87 64.94
Ensemble-2 89.97 79.81 88.92 86.60 94.46 91.38 84.03 83.50 91.04 90.81
Ensemble-3 94.27 78.71 85.37 93.47 93.72 96.34 83.43 91.50 91.74 76.90
Ensemble-4 94.54 60.82 76.82 97.98 82.09 68.92 83.56 94.15 80.13 72.90
Ensemble-5 88.83 55.79 72.23 94.96 63.50 66.30 83.59 94.51 61.12 68.76
D1 = NSL-KDD, D2 = UNSW-NB15, D3 = ISCXURL2016, D4 = Malmem2020, D5 = CIC-DDoS2019,
D6 = ToN-IoT-Network, D7 = Darknet2020, D8 = CIC-DoS2017, D9 = XIIOTID, ToN-IoT-Linux

The usfAD appears to be a standout performer, achieving high accuracy and F1-scores even on these datasets, both individually and in ensemble settings. For example, the usfAD achieved the highest accuracy 94.96%, 80.23%, 92.38% , and 98.69% respectively for NSL-KDD, UNSW-NB15, ISCXURL2016, and CIC-DDoS2019. Particularly commendable performances by usfAD are evident in the ToN-IoT-Network and ToN-IoT-Linux datasets, where it achieved accuracy of 99.43% and 97.94%, respectively. However, considering individual model, AE always shows better accuracy for the Malmem2022 dataset, and the Ensemble-Any Four strategy outperforms other outlier methods, securing an accuracy of 97.98% for this dataset. This underlines the efficacy of the ensemble strategy introduced in this study.

For the UNSW NB15 dataset, LOF achieves an accuracy and F1-score of 80.87% respectively, closely rivaling usfAD. When considering ensemble configurations, the Ensemble-Any Four approach parallels usfAD’s performance on the NSL-KDD dataset. On the UNSW-NB15 and ISCXURL2016 datasets, Ensemble-Any Two outperforms other outlier detection methods, including LOF, IF, OCSVM, VAE, and AE. It’s noteworthy that VAE and AE yield nearly identical results. Due to this similarity, and given VAE’s intensive memory requirements, we opted to exclude VAE for other datasets including Malmem2022 and CIC-DDoS2019.

Table 3 presents the macro average F1-score results of the semi-supervised learning for 10 different IDS datasets. usfAD consistently delivers strong results across all datasets, underscoring its suitability for handling IDS datasets in terms of macro average F1-score.

Particularly commendable performances by usfAD are evident in the ToN-IoT-Network and ToN-IoT-Linux datasets, where it achieved F1-scores of 99.37% and 97.65% respectively. LOF also demonstrates a strong performance on the ToN-IoT-Linux dataset but has varying results on the other two datasets in terms of F1-score. Such results are on par with supervised algorithms like RF.

Table 3: Macro average F1-score of outlier methods and their ensemble approaches
Models D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
LOF 87.48 80.61 76.58 87.99 88.29 98.03 88.37 64.03 75.88 96.35
VAE 89.01 53.75 69.46
AE 91.52 55.93 70.18 94.97 75.58 42.55 51.73 58.49 81.75 54.86
OCSVM 72.24 69.56 70.82 73.37 75.38 67.53 44.07 42.56 67.97 43.31
IF 90.22 56.37 62.53 90.26 54.03 46.21 51.94 58.72 67.13 45.74
usfAD 95.91 81.84 87.61 94.64 98.04 99.37 84.06 88.49 93.32 97.65
Ensemble-1 69.22 70.04 72.86 67.04 72.53 66.20 53.32 38.44 69.23 64.94
Ensemble-2 89.95 79.19 83.03 86.36 91.31 90.96 76.03 65.79 90.98 90.08
Ensemble-3 94.27 78.46 80.83 93.44 91.16 96.06 66.67 73.05 91.55 68.62
Ensemble-4 94.50 60.39 72.93 97.98 82.10 51.28 55.75 69.97 77.92 57.86
Ensemble-5 88.58 54.59 69.10 94.94 50.73 43.38 51.03 62.93 46.94 46.60
D1 = NSL-KDD, D2 = UNSW-NB15, D3 = ISCXURL2016, D4 = Malmem2020, D5 = CIC-DDoS2019,
D6 = ToN-IoT-Network, D7 = Darknet2020, D8 = CIC-DoS2017, D9 = XIIOTID, D10 = ToN-IoT-Linux

Table 4, and 5 present the performance metrics of various outlier detection models and their ensemble approaches in terms of average precision and recall of 10-fold stratified cross validation for the attack class only, across 10 distinct datasets. Recall, also known as sensitivity or true positive rate, is a crucial metric in the context of outlier detection for security-related tasks. High recall is desirable in scenarios where missing any attack instance is considered highly detrimental, as is often the case in cybersecurity. Our goal in this work is to accurately identify attack class while taking into account the importance of such classification in cybersecurity applications. In cybersecurity, misclassifying an attack as normal poses a greater threat to the system than incorrectly labeling a normal instance as an attack. Because of this, we focus on the average recall and precision of the attack class for the outliers/OCC techniques.

Table 4: Precision of outlier methods and their ensemble approaches for attack class
Models D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
LOF 87.13 96.86 96.39 86.21 95.08 95.08 76.95 24.75 96.32 92.66
VAE 89.11 85.73 96.12
AE 89.65 86.80 96.23 90.96 25.48 25.48 22.10 18.89 84.82 52.47
OCSVM 64.74 75.59 87.07 66.69 51.84 51.84 17.29 10.60 58.74 27.27
IF 93.82 92.28 96.40 83.86 39.40 39.40 24.45 20.64 82.99 49.57
usfAD 96.47 97.04 92.78 90.34 98.39 98.39 77.69 72.50 96.90 99.29
Ensemble-1 62.53 75.33 85.83 62.50 50.85 50.85 27.76 10.14 59.45 48.77
Ensemble-2 84.03 90.97 92.27 78.88 80.20 80.20 52.01 26.59 85.83 80.62
Ensemble-3 93.74 95.41 95.91 88.45 90.52 90.52 52.36 40.10 92.18 83.69
Ensemble-4 98.39 98.00 97.04 96.15 89.14 89.14 60.38 56.18 97.17 95.49
Ensemble-5 99.64 99.56 98.14 99.51 91.16 91.16 79.67 77.26 97.07 98.41
D1 = NSL-KDD, D2 = UNSW-NB15, D3 = ISCXURL2016, D4 = Malmem2020, D5 = CIC-DDoS2019,
D6 = ToN-IoT-Network, D7 = Darknet2020, D8 = CIC-DoS2017, D9 = XIIOTID, D10 = ToN-IoT-Linux

Across CIC-DDoS2019 and ToN-IoT-Network datasets, LOF displays relatively high recall values, indicating its effectiveness in identifying attacks. Its precision is generally good, suggesting a balanced performance. Auto Encoder: Auto Encoder’s precision and recall vary greatly across datasets. While it demonstrates high precision in Malmem2022, its recall is generally low, implying it struggles to capture all positive instances, especially in CIC-DDoS2019 and TON-IOT-Network.

Table 5 shows that the usfAD achieves high recall across all datasets except Darknet2020 and ToN-IoT-Linux among individual models. For example, on the XIIOTID dataset, the usfAD achieves a precision of 96.9% and a recall of 87.92%. Similarly, on the ToN-IoT-Linux dataset, it reaches an impressive 99.29% precision and 94.49% recall. For Darknet2020 and ToN-IoT-Linux, LOF model shows relatively higher recall among the individual models.

Table 5: Recall of outlier methods and their ensemble approaches for attack class
Models D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
LOF 86.86 72.42 78.78 90.46 100 100.00 85.39 75.64 52.57 97.89
VAE 87.95 34.14 68.84
AE 93.15 37.22 69.78 99.88 6.35 6.35 13.64 34.49 72.60 22.13
OCSVM 98.86 87.39 89.53 100.00 100.00 100.00 50.37 87.92 92.19 37.51
IF 85.45 35.62 57.87 99.98 10.03 10.03 12.36 27.16 42.12 6.91
usfAD 94.99 74.33 97.95 100.00 100.00 100.00 70.40 85.79 87.92 94.49
Ensemble-1 100.00 90.53 98.09 100.00 100.00 100.00 99.50 97.83 96.83 99.03
Ensemble-2 97.72 75.95 93.80 100.00 100.00 100.00 77.61 91.15 95.12 95.40
Ensemble-3 94.40 70.05 85.07 99.99 100.00 100.00 36.62 69.20 88.52 38.25
Ensemble-4 90.12 39.50 72.81 99.96 12.54 12.54 12.48 34.89 55.97 19.74
Ensemble-5 77.06 30.96 66.01 90.36 3.84 3.84 5.97 17.92 10.97 6.51
D1 = NSL-KDD, D2 = UNSW-NB15, D3 = ISCXURL2016, D4 = Malmem2020, D5 = CIC-DDoS2019,
D6 = ToN-IoT-Network, D7 = Darknet2020, D8 = CIC-DoS2017, D9 = XIIOTID, D10 = ToN-IoT-Linux

Our examination of the recall metrics reveals that although the usfAD system exhibits commendable performance across various datasets, its ability to detect the majority of attacks is limited in specific cases, such as the USNW-NB15, Darknet2020, and XIIOTOD datasets. In practical scenarios, achieving a high rate of attack detection is crucial. To address this shortfall, one could consider employing various ensemble strategies using unsupervised models. As shown in Table 5, the Ensemble-1 approach, which flags an attack if any individual model reports one, achieves superior recall rates at the cost of precision in datasets like NSL-KDD, UNSW-NB15, and Darknet2020. Depending on the operational requirements, system administrators can decide which aspect—recall or precision—is more critical for their situation. By ensembling multiple models, one can enhance the recall rate of attacks, albeit at the cost of an increased number of false positives (where normal instances are incorrectly identified as attacks).

4.2.1 Performance of Ensemble Approaches

We present the performance of different ensemble approaches in Figure 11. As observed in Figure 11, some ensemble approaches have high recall but lower precision, while others exhibit high precision but lower recall. The choice between these models depends on the specific application’s priorities. For instance, in security applications where missing any attack instance is critical, models with higher recall might be preferred.

A trade-off between precision and recall needs to be considered. In situations where avoiding false negatives (missed attacks) is critical, models or ensembles with higher recall, such as usfAD and certain ensemble approaches, might be preferred. However, in scenarios where precision is more important, trade-offs must be carefully evaluated.

Refer to caption
Figure 11: Performance of different ensemble approaches for attack class

4.3 RF vs. usfAD: Detecting Unknown Attacks

Figure 12 and 13 (a) to (d) illustrates the performance of two models: the RF model and the usfAD (an outlier detection method) in terms of their accuracy and F1-score in detecting unknown attack classes for NSL-KDD, UNSW-NB15, CIC-DDoS2019, and ToN-IoT-Network datasets. In graph, C0: The RF model is trained with all types of attack instances that are present in the testing dataset. C1: The RF model is trained excluding one type of attack from the training dataset, which means this particular attack type becomes ”unknown” during testing. C2: The RF model is trained excluding two types of attacks from the training dataset, leading to two ”unknown” attack types during testing. C3: Similar to the above, the RF model is trained excluding three types of attacks. C4: The RF model is trained excluding four types of attacks and so on. However, in all these scenarios, the usfAD model is trained only on normal instances, meaning it doesn’t need knowledge of any attack types during training phase.

Refer to caption
Figure 12: Comparison of RF and usfAD in terms of accuracy and F1-score
  • RF model: In the C0 scenario, the RF model has an extremely high accuracy of 99.49%, 95.12%, 99.94%, 99.99% (as shown in Figure 4 (a), (c) and 9 (a), (c)) and F1-score of 99.47%, 96.18%, 99.92%, 99.99% (as shown in Figure 4 (b), (d) and 9 (b), (d)) on NSL-KD, UNSW-NB15, CIC-DDoS2019 and ToN-IoT-Network datasets. This is expected since it is trained with all types of attacks present in the test data. This suggests that its recall and precision are both high when all attack types are known during training. As we move from C1 to C4 for NSL KDD and C1 to C7 for UNSW NB15, we notice a decline in the RF’s accuracy and F1-score. This decline corresponds to the increasing number of ”unknown” attack types (those that RF wasn’t trained on). For NSL KDD, by C4, where RF isn’t trained on four types of attacks, its accuracy and F1-score drop to 51.88% and 0% respectively. By C4, the F1-score reduces drastically to 0, implying that the RF model fails completely in terms of both precision and recall for detecting the ”unknown” attacks. For other datasets including UNSW NB15, CIC-DDoS2019, and ToN-IoT-Network as we exclude increasing numbers of attack types from the training dataset, we notice a consistent decline in both accuracy and F1-score (C7: 54.36% and 43.87% for UNSW NB15, C5: 22.68% and 0% for CIC-DDoS2019, C6: 65.07%, and 0% for ToN IoT Network).

  • usfAD: The performance of usfAD remains consistent across all scenarios with an accuracy of 95.92%, 82.15%, 98.69%, 99.43% and an F1-score of 94.65%,84.18%, 99.19%, and 99.37% for NSL KDD, UNSW NB15, CIC-DDoS 2019 and ToN IoT Network. This implies that regardless of the ”unknown” attack types in the test data, usfAD can detect anomalies with the same efficiency. usfAD consistently manages to have a balanced precision and recall. This consistent performance can be attributed to the fact that usfAD is trained only on normal instances and focuses on spotting deviations from this norm.

Refer to caption
Figure 13: Comparison of RF and usfAD in terms of accuracy and F1-score

While the RF model showed slightly higher accuracy and precision compared to usfAD, it necessitates training with a large number of attack samples and struggles to detect unknown attacks. In contrast, usfAD doesn’t need any attack samples, making it better suited for detecting unknown attacks in real-world situations. The results indicate the robustness of the usfAD model in terms of both accuracy and F1-score. Despite the changing landscape of ”unknown” attacks in the test data, it manages to consistently perform well. On the other hand, the RF model, while performing admirably when trained with all attack types, sees a dramatic decline in both accuracy and F1-score as more attack types are omitted during training. By the time we reach C4, C7, C5, and C6 for NSL-KDD, UNSW-NB15, CIC-DDoS2019 and ToN-IoT-Network dataset, RF’s ability to detect ”unknown” attacks deteriorates considerably, emphasizing the challenges of using supervised models when the nature of threats evolves or is not entirely known during training.

4.4 Comparison of the Proposed Framework with the State-of-the-art Works

Hairab et al.Hairab et al. (2022) explored three testing scenarios for the BoT-IoT dataset: Scenario A involving normal data and DDoS attacks, Scenario B with normal data and OS Fingerprint attacks, and Scenario C with Normal data and Service Scan attacks. In our research, we included all attack types in the testing set and observed that the recall rate for attacks was 100%. This finding indicates that there is no need to separate the attack types into distinct scenarios. We achieve the same outcome for all three scenarios. Table 6 shows that the usfAD algorithm (that we first utilized to detect zero-day attacks in IDS) is superior, effectively handling not just specific but all categories of attacks. In addition, our experiments show that the LOF (Local Outlier Factor) also yields comparatively better results, especially in detecting attacks in a zero-day attack scenario than other existing methods on BoT-IoT datasets.

Table 6: Comparison of our model with Hairab et al.’s model
Model Accuracy Precision Recall F1-score
Normal Attack Normal Attack Normal Attack
CNN L2-AHairab et al. (2022) 99.98 99.96 99.99 99.98 99.98 99.97 99.99
CNN L2-BHairab et al. (2022) 98.49 95.01 99.99 99.98 97.90 97.43 98.93
CNN L2-CHairab et al. (2022) 90.75 75.55 99.99 99.98 87.05 86.07 93.08
Proposed method-1(LOF) 98.26 100 96.69 96.58 100 98.26 98.32
Proposed method-2(usfAD) 99.66 100 99.22 99.21 100 99.6 99.61

Mbona et al. Mbona and Eloff (2022) applied four methods called Gaussian mixture model, OCSVM, Label spreading and label propagation to detect zero-day attacks in IDS. They utilized four different datasets: UNSW-NB15, CIC-DDoS2019, IoT Intrusion 2020 and CIC-DoHBoW2020 datasets. In Figure 14, we can observe that except UNSW-NB15 datasets, the proposed method-2(usfAD) outperforms the existing state-of-the-art methodsMoustafa and Slay (2015); Sharafaldin et al. (2019); MontazeriShatoori et al. (2020) in terms of precision, recall and F1-score. Despite state-of-the-art methods showing enhanced performance with the UNSW-NB15 datasets, our experiments revealed relatively lower effectiveness for most techniques, including LOF, OCSVM, and IOF on this datasets.

Refer to caption
Figure 14: Comparison of the proposed approaches with the state-of-the-art methods

Sameera et al. Sameera and Shashi (2020) and other researchers Zhao et al. (2019a), Taghiyarrenani et al. (2018), Zhao et al. (2017) applied deep learning approach, mainly focused on transfer learning to detect zero-day attacks on NSL-KDD datasets. In Figure 15, we show the accuracy performance of our method ( accuracy for one of the 10-fold) with the state-of-the-art methods Sameera and Shashi (2020), Zhao et al. (2019a), Taghiyarrenani et al. (2018), Zhao et al. (2017). We noticed that suggested methods called usfAD outperforms the existing transfer learning approaches on NSL-KDD in terms of accuracy.

Refer to caption
Figure 15: Comparison of the proposed method with the state-of-the-art methods on NSL-KDD

5 Conclusion

In a practical setting, conventional supervised classification based IDS to enforce network security are inappropriate and ineffective because their efficacy depends on the availability of a large number of attack samples. However, it is difficult to acquire attack samples from security-related applications, as the nature of attacks changes frequently. To resolve this issue, we explore two strategies: 1) training supervised learning with datasets of uniformly generated random noise to detect potential future attacks 2) Examine the efficacy of outlier methods and their varied ensemble approaches. Our experiment demonstrated that artificially simulated attack samples with supervised learning are ineffective for detecting unknown attacks. However, our findings indicate that the outlier methods can produce higher accuracy and F1 scores for the majority of benchmark datasets. usfAD is more effective than other widely used outlier techniques for detecting normal and attack classes in an intrusion detection system (IDS). In addition, we demonstrated that a combination of models and ensemble methods could be used to maximize the efficacy of outlier detection in security applications. By selecting and combining models with high recall and precision, it is possible to build a robust system capable of accurately identifying attacks while minimizing false negatives and false positives. Lastly, models such as usfAD and ensemble approach that consistently demonstrate high recall values stand out as effective options for identifying attack instances across a variety of datasets. In our future research, we aim to explore the potential of usfAD for hierarchical multi-classification to identify various cyberattack types, such as DoS, Ransomware, Spyware, and Trojan Horse. In addition, we need to devise an appropriate strategy for generating noise labeled as attacks across the entirety of the feature spaces, which will aid supervised learning in detecting previously unobserved attacks.

Declarations

Conflict of interest

The authors have no conflicts of interest to declare that they are relevant to the content of this article.

Acknowledgments

This material is based upon work supported by the Air Force Office of Scientific Research under award number FA2386-23-1-4003.

Author statements

Md Ashraf Uddin: Conceptualization; Data curation; Implementation, Roles/Writing-original draft; and Writing, Visualization; Formal analysis. Sunil Aryal: Funding acquisition; Investigation; Methodology; Project administration; Resources; Software; Supervision; Validation; Roles/Writing-original draft; and Writing - review & editing. Mohamed Reda Bouadjenek: Conceptualization; Project administration; Muna Al-Hawawreh: Review & editing. Md. Alamin Talukder: Data curation; Implementation; Visualization.

References

  • Agate et al. (2024) Agate, V., Ferraro, P., Re, G. L., and Das, S. K. (2024). Blind: A privacy preserving truth discovery system for mobile crowdsensing. Journal of Network and Computer Applications, 223:103811.
  • Aghaei and Serpen (2019) Aghaei, E. and Serpen, G. (2019). Host-based anomaly detection using eigentraces feature extraction and one-class classification on system call trace data. arXiv preprint arXiv:1911.11284.
  • Al-Hawawreh et al. (2022) Al-Hawawreh, M., Sitnikova, E., and Aboutorab, N. (2022). X-iiotid: A connectivity-agnostic and device-agnostic intrusion data set for industrial internet of things. IEEE Internet of Things Journal, 9(5):3962–3977.
  • Al-Qudah et al. (2023) Al-Qudah, M., Ashi, Z., Alnabhan, M., and Abu Al-Haija, Q. (2023). Effective one-class classifier model for memory dump malware detection. Journal of Sensor and Actuator Networks, 12(1):5.
  • Alazzam et al. (2022) Alazzam, H., Sharieh, A., and Sabri, K. E. (2022). A lightweight intelligent network intrusion detection system using ocsvm and pigeon inspired optimizer. Applied Intelligence, 52(4):3527–3544.
  • An and Cho (2015) An, J. and Cho, S. (2015). Variational autoencoder based anomaly detection using reconstruction probability. Special lecture on IE, 2(1):1–18.
  • Anand and Saifulla (2023) Anand, N. and Saifulla, M. (2023). An efficient ids for slow rate http/2.0 dos attacks using one class classification. In 2023 IEEE 8th International Conference for Convergence in Technology (I2CT), pages 1–9. IEEE.
  • Arregoces et al. (2022) Arregoces, P., Vergara, J., Gutiérrez, S. A., and Botero, J. F. (2022). Network-based intrusion detection: A one-class classification approach. In NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium, pages 1–6. IEEE.
  • Aryal (2018) Aryal, S. (2018). Anomaly detection technique robust to units and scales of measurement. In Proceedings of the 2018 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018), pages 589–601, Cham. Springer International Publishing.
  • Aryal et al. (2021) Aryal, S., Santosh, K., and Dazeley, R. (2021). usfad: a robust anomaly detector based on unsupervised stochastic forest. International Journal of Machine Learning and Cybernetics, 12:1137–1150.
  • Aryal and Wells (2021) Aryal, S. and Wells, J. R. (2021). Ensemble of local decision trees for anomaly detection in mixed data. In Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part I 21, pages 687–702. Springer.
  • Belenguer et al. (2023) Belenguer, A., Pascual, J. A., and Navaridas, J. (2023). Göwfed: A novel federated network intrusion detection system. Journal of Network and Computer Applications, 217:103653.
  • Bezerra et al. (2019) Bezerra, V. H., da Costa, V. G. T., Barbon Junior, S., Miani, R. S., and Zarpelão, B. B. (2019). Iotds: A one-class classification approach to detect botnets in internet of things devices. Sensors, 19(14):3188.
  • Breunig et al. (2000) Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. (2000). Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 93–104.
  • Carrier et al. (2022) Carrier, T., Victor, P., Tekeoglu, A., and Lashkari, A. H. (2022). Detecting obfuscated malware using memory feature engineering. In ICISSP, pages 177–188.
  • da Silva et al. (2016) da Silva, E. G., da Silva, A. S., Wickboldt, J. A., Smith, P., Granville, L. Z., and Schaeffer-Filho, A. (2016). A one-class nids for sdn-based scada systems. In 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), volume 1, pages 303–312. IEEE.
  • Dini et al. (2022) Dini, P., Begni, A., Ciavarella, S., De Paoli, E., Fiorelli, G., Silvestro, C., and Saponara, S. (2022). Design and testing novel one-class classifier based on polynomial interpolation with application to networking security. IEEE Access, 10:67910–67924.
  • Disha and Waheed (2022) Disha, R. A. and Waheed, S. (2022). Performance analysis of machine learning models for intrusion detection system using gini impurity-based weighted random forest (giwrf) feature selection technique. Cybersecurity, 5(1):1.
  • Fahad et al. (2017) Fahad, U. M., Muhammad, S., and Bi, Y. (2017). Applying one-class classification techniques to ip flow records for intrusion detection. Baltic Journal of Modern Computing, 5(1):70–86.
  • Fernando and Webb (2017) Fernando, T. L. and Webb, G. I. (2017). Simusf: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption. Data mining and knowledge discovery, 31:264–286.
  • Gu and Lu (2021) Gu, J. and Lu, S. (2021). An effective intrusion detection approach using svm with naïve bayes feature embedding. Computers & Security, 103:102158.
  • Guo et al. (2023) Guo, G., Pan, X., Liu, H., Li, F., Pei, L., and Hu, K. (2023). An iot intrusion detection system based on ton iot network dataset. In 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC), pages 0333–0338. IEEE.
  • Hairab et al. (2022) Hairab, B. I., Elsayed, M. S., Jurcut, A. D., and Azer, M. A. (2022). Anomaly detection based on cnn and regularization techniques against zero-day attacks in iot networks. IEEE Access, 10:98427–98440.
  • Injadat et al. (2020) Injadat, M., Moubayed, A., Nassif, A. B., and Shami, A. (2020). Multi-stage optimized machine learning framework for network intrusion detection. IEEE Transactions on Network and Service Management, 18(2):1803–1816.
  • Jazi et al. (2017) Jazi, H. H., Gonzalez, H., Stakhanova, N., and Ghorbani, A. A. (2017). Detecting http-based application layer dos attacks on web servers in the presence of sampling. Computer Networks, 121:25–36.
  • Khan and Madden (2014) Khan, S. S. and Madden, M. G. (2014). One-class classification: taxonomy of study and review of techniques. The Knowledge Engineering Review, 29(3):345–374.
  • Khraisat et al. (2020) Khraisat, A., Gondal, I., Vamplew, P., Kamruzzaman, J., and Alazab, A. (2020). Hybrid intrusion detection system based on the stacking ensemble of c5 decision tree classifier and one class support vector machine. Electronics, 9(1):173.
  • Kilincer et al. (2021) Kilincer, I. F., Ertam, F., and Sengur, A. (2021). Machine learning methods for cyber security intrusion detection: Datasets and comparative study. Computer Networks, 188:107840.
  • Kilincer et al. (2022) Kilincer, I. F., Ertam, F., and Sengur, A. (2022). A comprehensive intrusion detection framework using boosting algorithms. Computers and Electrical Engineering, 100:107869.
  • Li et al. (2020) Li, X., Chen, W., Zhang, Q., and Wu, L. (2020). Building auto-encoder intrusion detection system based on random forest feature selection. Computers & Security, 95:101851.
  • Liu et al. (2021) Liu, C., Gu, Z., and Wang, J. (2021). A hybrid intrusion detection system based on scalable k-means+ random forest and deep learning. Ieee Access, 9:75729–75740.
  • Liu et al. (2008) Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008). Isolation forest. In 2008 eighth ieee international conference on data mining, pages 413–422. IEEE.
  • Mahmood et al. (2024) Mahmood, T., Li, J., Saba, T., Rehman, A., and Ali, S. (2024). Energy optimized data fusion approach for scalable wireless sensor network using deep learning-based scheme. Journal of Network and Computer Applications, page 103841.
  • Mamun et al. (2016) Mamun, M. S. I., Rathore, M. A., Lashkari, A. H., Stakhanova, N., and Ghorbani, A. A. (2016). Detecting malicious urls using lexical analysis. In Network and System Security: 10th International Conference, NSS 2016, Taipei, Taiwan, September 28-30, 2016, Proceedings 10, pages 467–482. Springer.
  • Mbona and Eloff (2022) Mbona, I. and Eloff, J. H. (2022). Detecting zero-day intrusion attacks using semi-supervised machine learning approaches. IEEE Access, 10:69822–69838.
  • Mhamdi et al. (2020) Mhamdi, L., McLernon, D., El-Moussa, F., Zaidi, S. A. R., Ghogho, M., and Tang, T. (2020). A deep learning approach combining autoencoder with one-class svm for ddos attack detection in sdns. In 2020 IEEE Eighth International Conference on Communications and Networking (ComNet), pages 1–6. IEEE.
  • Min et al. (2021) Min, B., Yoo, J., Kim, S., Shin, D., and Shin, D. (2021). Network anomaly detection using memory-augmented deep autoencoder. IEEE Access, 9:104695–104706.
  • MontazeriShatoori et al. (2020) MontazeriShatoori, M., Davidson, L., Kaur, G., and Lashkari, A. H. (2020). Detection of doh tunnels using time-series classification of encrypted traffic. In 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), pages 63–70. IEEE.
  • Moustafa (2021) Moustafa, N. (2021). A new distributed architecture for evaluating ai-based security systems at the edge: Network ton_iot datasets. Sustainable Cities and Society, 72:102994.
  • Moustafa and Slay (2015) Moustafa, N. and Slay, J. (2015). The significant features of the unsw-nb15 and the kdd99 data sets for network intrusion detection systems. In 2015 4th international workshop on building analysis datasets and gathering experience returns for security (BADGERS), pages 25–31. IEEE.
  • Naseri and Gharehchopogh (2022) Naseri, T. S. and Gharehchopogh, F. S. (2022). A feature selection based on the farmland fertility algorithm for improved intrusion detection systems. Journal of Network and Systems Management, 30(3):40.
  • Negandhi et al. (2019) Negandhi, P., Trivedi, Y., and Mangrulkar, R. (2019). Intrusion detection system using random forest on the nsl-kdd dataset. In Emerging Research in Computing, Information, Communication and Applications: ERCICA 2018, Volume 2, pages 519–531. Springer.
  • Nguyen et al. (2018) Nguyen, Q. T., Tran, K. P., Castagliola, P., Huong, T. T., Nguyen, M. K., and Lardjane, S. (2018). Nested one-class support vector machines for network intrusion detection. In 2018 IEEE Seventh International Conference on Communications and Electronics (ICCE), pages 7–12. IEEE.
  • Rousseeuw (1985) Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. Mathematical statistics and applications, 8(283-297):37.
  • Roy et al. (2022) Roy, S., Li, J., Choi, B.-J., and Bai, Y. (2022). A lightweight supervised intrusion detection mechanism for iot networks. Future Generation Computer Systems, 127:276–285.
  • Sameera and Shashi (2020) Sameera, N. and Shashi, M. (2020). Deep transductive transfer learning framework for zero-day attack detection. ICT Express, 6(4):361–367.
  • Sánchez et al. (2021) Sánchez, P. M. S., Valero, J. M. J., Celdrán, A. H., Bovet, G., Pérez, M. G., and Pérez, G. M. (2021). A survey on device behavior fingerprinting: Data sources, techniques, application scenarios, and datasets. IEEE Communications Surveys & Tutorials, 23(2):1048–1077.
  • Schölkopf et al. (1999) Schölkopf, B., Williamson, R. C., Smola, A., Shawe-Taylor, J., and Platt, J. (1999). Support vector method for novelty detection. Advances in neural information processing systems, 12.
  • Sharafaldin et al. (2019) Sharafaldin, I., Lashkari, A. H., Hakak, S., and Ghorbani, A. A. (2019). Developing realistic distributed denial of service (ddos) attack dataset and taxonomy. In 2019 International Carnahan Conference on Security Technology (ICCST), pages 1–8. IEEE.
  • Su et al. (2020) Su, T., Sun, H., Zhu, J., Wang, S., and Li, Y. (2020). Bat: Deep learning methods on network intrusion detection using nsl-kdd dataset. IEEE Access, 8:29575–29585.
  • Taghiyarrenani et al. (2018) Taghiyarrenani, Z., Fanian, A., Mahdavi, E., Mirzaei, A., and Farsi, H. (2018). Transfer learning based intrusion detection. In 2018 8th International Conference on Computer and Knowledge Engineering (ICCKE), pages 92–97. IEEE.
  • Talukder et al. (2023) Talukder, M. A., Hasan, K. F., Islam, M. M., Uddin, M. A., Akhter, A., Yousuf, M. A., Alharbi, F., and Moni, M. A. (2023). A dependable hybrid machine learning model for network intrusion detection. Journal of Information Security and Applications, 72:103405.
  • Talukder et al. (2024a) Talukder, M. A., Hossen, R., Uddin, M. A., Uddin, M. N., and Acharjee, U. K. (2024a). Securing transactions: A hybrid dependable ensemble machine learning model using iht-lr and grid search. arXiv preprint arXiv:2402.14389.
  • Talukder et al. (2024b) Talukder, M. A., Islam, M. M., Uddin, M. A., Hasan, K. F., Sharmin, S., Alyami, S. A., and Moni, M. A. (2024b). Machine learning-based network intrusion detection for big and imbalanced data using oversampling, stacking feature embedding and feature extraction. Journal of Big Data, 11(1):1–44.
  • Talukder et al. (2024c) Talukder, M. A., Sharmin, S., Uddin, M. A., Islam, M. M., and Aryal, S. (2024c). Mlstl-wsn: Machine learning-based intrusion detection using smotetomek in wsns. arXiv preprint arXiv:2402.13277.
  • Wan et al. (2017) Wan, M., Shang, W., and Zeng, P. (2017). Double behavior characteristics for one-class classification anomaly detection in networked control systems. IEEE Transactions on Information Forensics and Security, 12(12):3011–3023.
  • Wu et al. (2022) Wu, T., Fan, H., Zhu, H., You, C., Zhou, H., and Huang, X. (2022). Intrusion detection system combined enhanced random forest with smote algorithm. EURASIP Journal on Advances in Signal Processing, 2022(1):1–20.
  • Xu et al. (2021) Xu, W., Jang-Jaccard, J., Singh, A., Wei, Y., and Sabrina, F. (2021). Improving performance of autoencoder-based network anomaly detection on nsl-kdd dataset. IEEE Access, 9:140136–140146.
  • Zhao et al. (2017) Zhao, J., Shetty, S., and Pan, J. W. (2017). Feature-based transfer learning for network security. In MILCOM 2017-2017 IEEE Military Communications Conference (MILCOM), pages 17–22. IEEE.
  • Zhao et al. (2019a) Zhao, J., Shetty, S., Pan, J. W., Kamhoua, C., and Kwiat, K. (2019a). Transfer learning for detecting unknown network attacks. EURASIP Journal on Information Security, 2019:1–13.
  • Zhao et al. (2019b) Zhao, Y., Nasrullah, Z., and Li, Z. (2019b). Pyod: A python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20(96):1–7.
  • Zhou and Paffenroth (2017) Zhou, C. and Paffenroth, R. C. (2017). Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 665–674.