Expectations Versus Reality: Evaluating Intrusion Detection Systems in Practice

Jake Hesford1, Daniel Cheng1, Alan Wan1, Larry Huynh1, Seungho Kim2, Hyoungshick Kim2, and Jin B. Hong1 ✉

1University of Western Australia, Australia
Emails: 21683564@student.uwa.edu.au, 23126543@student.uwa.edu.au, 23072152@student.uwa.edu.au,
larry.huynh@uwa.edu.au, jin.hong@uwa.edu.au ✉ 2Sungkyunkwan University, Republic of Korea
Emails: kimsho98@naver.com, hyoung@skku.edu

Abstract

Our paper provides empirical comparisons between recent IDSs to provide an objective comparison between them to help users choose the most appropriate solution based on their requirements. Our results show that no one solution is the best, but is dependent on external variables such as the types of attacks, complexity, and network environment in the dataset. For example, BoT_IoT and Stratosphere IoT datasets both capture IoT-related attacks, but the deep neural network performed the best when tested using the BoT_IoT dataset while HELAD performed the best when tested using the Stratosphere IoT dataset. So although we found that a deep neural network solution had the highest average F1 scores on tested datasets, it is not always the best-performing one. We further discuss difficulties in using IDS from literature and project repositories, which complicated drawing definitive conclusions regarding IDS selection.

Index Terms:

Comparative Analysis, Intrusion Detection System, Machine Learning

I Introduction

There are many different intrusion detection systems (IDS) proposed over the years, advancing the state-of-the-art with high-performance measures reported [1, 2]. However, it is also a challenge when trying to compare them and choose the best one for your needs, because there is no standardisation due to the complexity of the environment that these IDSs were designed for. In order to determine to what degree IDSs can be adapted to different environments, we compare their performance across common Network Intrusion Detection Systems (NIDS) datasets. This approach aims to provide a more standardized basis for comparison, taking into account different variables such as attack types, networking technology, and network environments.

There are several key challenges when standardising IDSs for comparison. The first one is using different datasets for testing. IDS solutions are often tested on diverse datasets, each with its unique characteristics and labelling methodologies. This variability complicates direct comparisons between different IDSs, as each system may be optimised for specific types of data [3]. Furthermore, the prevalent use of non-standardised datasets in testing makes it difficult to assess the generalisability and robustness of these systems. Where IDSs were tested on datasets constructed by the IDS developers, the performance seen in these tests may not generalise well when testing against other datasets [4].

The next significant challenge arises from the number of configuration options available in IDSs. These options, while allowing for customisation, can also lead to a lack of clarity in the evaluation methodology. Without a standardised approach to configuring these systems, results can vary widely, making it hard to determine the true effectiveness of an IDS under different scenarios [5]. One important factor in this particular area is that it remains unclear whether this is primarily due to the dataset, or the IDSs. Even for the most well-cited IDS datasets, there is literature available which questions the correctness of the data available. The intensive and complex process of dataset creation can often lead to error-prone output, presenting challenges in standardised testing if it does not also include extensive data wrangling and modification alongside it [6].

Another challenge is that IDSs commonly either take packets or flows. Where a dataset does not contain both of these formats, adapting it into the form expected by a given IDS is non-trivial, where the expected format is not the one provided by the dataset authors. This discrepancy presents challenges in obtaining satisfactory results when an IDS and dataset are incompatible without significant processing [1]. Our evaluation process was further complicated by the necessity of converting these datasets into formats compatible with various IDS solutions. This data wrangling could amplify the errors and inconsistencies inherent in the datasets. Such transformation processes can introduce additional noise and inaccuracies, thereby skewing the results of the IDS evaluations. For other important historical datasets that have been used to benchmark many IDSs (such as NSL-Cup 99), the pcap files may not be available at all. Where machine learning and autoencoder-based IDSs relied on particular features to make their assessment, there were difficulties where these files were unavailable or crucial features were not extracted, and utilising these systems became problematic [1].

Furthermore, the effectiveness of autoencoder-based IDSs, which rely on temporal and benign data to function optimally, is significantly hindered when pcap files are not categorised by scenario or if specific benign traffic data is absent. These systems require a baseline of ‘normal’ traffic against which to compare potentially malicious activity. Without this, training these models to accurately detect anomalies is challenging. The absence of scenario-specific or benign traffic data limits the ability of these systems to establish a normative profile of network activity, which is crucial for their anomaly detection mechanisms. In these cases, we were able to attempt to train the models on initial benign traffic in the dataset, but this often did not result in adequate performance and may not accurately represent a proper ‘baseline’ of traffic in the scenario [7].

Considering these issues, we propose a pipeline for the effective comparison of IDSs. Machine-learning based network IDSs (NIDS) have been increasing in popularity due to their ability to see more complex patterns, and thus display higher protection against unknown attacks than traditional IDSs [7]. These benefits also come with an increase in complexity. In order to determine how flexible these systems are, we focus on NIDSs for the purpose of this study. To evaluate how well these NIDSs generalise, we use five NIDS datasets.

II Related Work

When assessing NIDSs, there are two main components to be considered — how an IDS performs with a given dataset, and how datasets perform when run through a given NIDS. Concerning the latter component, work has been done in critically analysing the features of popular IDS datasets. Binbusayyis and Vaiyapuri [8] sought to find the optimal feature sets of 4 popular datasets using an ensemble method of statistical filters. Layeghy et al. [9] conducted an in-depth analysis of the statistical properties of benign traffic in the CICIDS2017, UNSW and TON-IOT datasets, and compared them with two real-world datasets. They found significant differences in the statistical features between the synthetic and real-world datasets and concluded that the evaluation of NIDS algorithms on synthetic datasets does not guarantee performance in real-world scenarios. Ghurab et al. [10] took an extensive look at the different NIDS datasets used for benchmarking, concluding that using recent datasets may be recommended due to their wider attack coverage, but note that all datasets may be appropriate in differing circumstances. However, since the datasets are not run through any IDSs in this study, there is no clear framework or pipeline provided that could be used to directly compare the performance of different IDSs on these datasets, which is crucial information for users who wish to adopt and use those IDSs in practice. Some other works have also looked at benchmarking IDSs, but many start at the evaluation steps with all common datasets and IDSs setup [11, 12]. Furthermore, some works propose frameworks for this benchmarking, but do not practically implement it [13]. Starting from this idealised view does not address the practical setup issues we encountered. Maseer et al. [14] investigated the performance of various machine learning algorithms, both supervised and unsupervised, on the CICIDS2017 dataset. This study evaluated the performance of various classical ML algorithms on the web-based attacks within CICIDS2017. Antunes et al. [15] evaluate the CSE-CIC-IDS2018 dataset and benchmark common deep learning methods, such as LSTM and CNN. However, these studies each consider a different single dataset in their benchmarking process, rather than a comprehensive evaluation across multiple datasets. As mentioned above, the characteristics of datasets differ, and it is suggested to use more recent datasets to observe results closer to the practical settings.

There were some work that focused on identifying the limitations of current IDS research. Ahmad et al. [4] identified a prevalent trend in IDS research where proposed solutions, developed and evaluated using a single model and dataset, often exhibit inherent bias. The shortage of reliable, real-world datasets has been cited to be a contributor to this issue [4, 16]. This specialisation can thus result in significantly reduced performance when the model is applied to different network datasets. Our work extends these findings by critically evaluating actual research and open-source IDS implementations across multiple datasets, offering a more comprehensive assessment of their practical performance and accessibility.

III IDS Analysis Pipeline

For analysing and comparing Network IDSs (NIDS), we first select recent IDSs from the literature and public repository. Due to the large volume of new IDSs, we limit the selection, which is further described in Section III-A. Next, we select datasets for testing the selected IDSs, which is shown in Section III-B.

III-A IDS Selection

The first step was to select the NIDSs to be evaluated, and Table I provides an overview of the examined IDSs and our selection/exclusion of them based on the criteria. In order to complete this, various criteria were used. We dealt with two main IDS categories - academic and non-academic. This was done in order to determine if there was a noticeable variation in usability or performance between peer-reviewed (academic) studies and practically applicable public (non-academic) systems. With academic systems, one might expect they would be more on the experimental side, thus possibly showing a higher variance in results or better results at the cost of setup/unknown stability. Further, publicly developed NIDSs could feasibly sacrifice some performance optimisations in favour of more stable releases, or be older and more highly trusted to be used in industry. We determined these trade-offs could be worth comparing to academic systems. We used differing criteria for the two types according to their respective properties.

For academic NIDS, the criteria were as follows;

1.

Recency: Papers selected had to be published within the last 5 years to capture recent and relevant insights into a rapidly evolving field of research.
2.

Code Availability: Each paper had to have the IDS code attached and available, usually through a GitHub Repository. Repositories with unavailable code could not be used as we could not run the analysis process on them.
3.

ML-Oriented: The IDSs had to use a machine learning-based detection scheme. ML NIDSs are a rapidly growing and highly promising section of the field as discussed in the introduction.
4.

Reliability: Publisher reliability was used, evaluating the ranking of each studies associated journal or conference. Preference was given to papers with more reputable publishers, as reliable conferences typically yield higher quality NIDS comparisons.
5.

Usability: The NIDSs were also assessed by the level of difficulty required to run the NIDS out of the box. A study was prioritised if the code could simply be cloned and run directly, producing similar results to those the academic paper cited. An easily usable study would be more likely to see wider use, and thus be more worthy of inclusion in this assessment of NIDS performance in broader contexts. Furthermore, a well produced, coherent and simple NIDS interface tends to indicate a higher quality of system and design. We made minimal changes such as updating deprecated library versions and changing absolute path locations, but an IDS was invalidated if it was unable to be run following these changes.

There were a large number of NIDSs that satisfied the first four criteria, but many fell short on the fifth (Usability). Due to absence of provided virtual environments or interpreter versions, many of the NIDSs could not be run and were invalidated. In the case that no environment was provided, we attempted to run the system and determine a compatible environment. However, this was often complicated by resulting package incompatibilities between versions, such as between Keras and Tensorflow, containing the necessary functionality.

Following this process, we selected the remaining academic NIDSs that satisfied all of the necessary criteria; Kitsune [7], HELAD [17] and Deep Neural Networks (DNN) [18].

Kitsune is an online, unsupervised, plug-and-play NIDS leveraging an ensemble of autoencoders. This paper was selected due to its’ popularity, with around one thousand citations, along with its’ adherence to the other criteria.

HELAD built on the works of the Kitsune IDS, and along with its adherence to the criteria, was selected to provide a comparison to the popular Kitsune IDS to determine its difference in performance.

The DNN study [18] compared the performance of various classical machine learning algorithms, and also established that they found a deep neural network of 3 layers was the optimal dimensions for their study. Past meeting all the other criteria, the broad coverage of this study was the primary reason for its inclusion.

For non-academic NIDSs, the criteria were as follows;

1.

Code Availability: The code had to be publicly available, primarily through GitHub repositories.
2.

Popularity: The GitHub repositories had to have over 250 stars and 100 forks. More popular repositories would tend to be of a higher quality, or at least exhibit simpler interfaces and be adaptable for wider use cases.
3.

Proper Documentation: Accurate and detailed information in relation to error states, machine learning mechanisms, and architecture must be available. As these public projects did not have directly connected papers attached, sufficient documentation surrounding setup and usage instructions has to be provided in order to verify expected results and run the system.
4.

Ongoing Support: There must be evidence of continuous maintenance for the NIDS, determined by the presence of active contributions to its source code. Regularly updated repositories would tend to reflect newer advancements, or at least be operable on recent devices.
5.

Usability: Similar to the academic NIDS selection criteria, the code is required to run without significant alteration.

We selected one IDS from this process - Stratosphere Linux IPS (Slips) [19]. Slips claimed to be a behavioural-based intrusion detection and prevention system that uses machine learning algorithms to detect malicious network traffic. Version 1.0.7 of Slips was used for testing and evaluation but new versions are constantly being released.

As previously mentioned, one of the key issues in IDS selection was being able to actually run the systems without significant issues. But due to the dependencies, one of the the primary challenges was determining the optimal versions of library functions and architectures used. Ideally, virtual environments are provided to simplify this process and inform users precisely what setup is optimal. However, these proved to be frequently absent within academic IDSs, which led to us being unable to reproduce the authors’ systems. This meant that when trying to run them, we encountered various errors as shown in Table 1.

TABLE I: IDSs investigated - green systems were used in the study, with red systems excluded for the displayed reason.

NIDS	Year	Dataset	Source	Usability/Issues
Deep Neural Network (DNN) [18]	2018	KDDCup-‘99’	Conference: ICCCNT	Used in Paper
Kitsune [7]	2018	Custom IoT Dataset	Conference: NDSS	Used in Paper
HELAD [17]	2020	CICIDS2017	Journal: MDPI Informatics	Used in Paper
Multiclass Classification [20]	2020	ASNM Datasets	Conference: DSAA	Vague dependencies in provided repository, ”ValueError on converting string to complex in ASNM-TUN.py”
ARTEMIS.[21]	2021	Custom Dataset	Conference: LATINCOM	Code error
Dense-Attention-LSTM, DAL [22]	2021	UNSW-NB15	Conference: IWCMC	Dependency errors
I-SiamIDS [23]	2021	CICIDS, NSL-KDD	Journal: Applied Intelligence	Type error
SecureTea [24]	2021	N/A	GitHub	Dependency errors
AutoML [25]	2022	CICIDS2017, IoTID20	Journal: Engineering Applications of Artificial Intelligence	IDS code not provided
Deep Belief Networks NIDS [26]	2022	CICIDS2017	Conference: SciSec	Invalidated by dependency errors in provided repository: ”Tensors found on two or more devices”
RIDS [27]	2022	Custom Dataset	Conference: GLOBECOM	Provided Out of memory
StratosphereIPS (Slip) [19]	2022	N/A	GitHub	Used in Paper
IDS-ML [28]	2022	CICIDS2017	Journal: Software Impacts	Runtime errors
xNIDS [29]	2023	Mirai, CICDoS2017, NSL-KDD	Conference: USENIX Security	Did not propose a directly usable NIDS, so was not appropriate.
Suricata [30]	2023	N/A	GitHub	Unable to verify any use of ML

III-B Dataset Selection

In our study, the selection and evaluation of datasets played a crucial role in assessing IDSs. We attempted to focus on datasets that varied in terms of attack types, traffic origin, protocols, and other key factors to ensure a comprehensive analysis.

III-B1 Selection Criteria

In order to provide a simple and structured initial evaluation of the datasets, we were guided by the selection methodologies outlined by previous works [31] ensuring a robust and systematic approach. Additionally, we have included ”Popularity” as a criteria, focusing on well-cited datasets to ensure that we are working with data that has been widely recognised by the research community. The criteria are as follows:

1.

Representation of Modern Network Threats: We prioritised datasets that included current and emerging attack types, as well as those that represented modern network traffic patterns, to better reflect the current landscape of potential threats. As attacks evolve, IDSs can become outdated if they are not updated to reflect modern threats. This is crucial for developing IDS software that can effectively mitigate real-world security threats, though may be mitigated by ML and AI systems.
2.

Realism and Diversity: Datasets that closely mimic real-world network environments and offer a diverse range of traffic types and attack scenarios were favoured. A broader range of attack types representing modern traffic will provide a better look at the adaptability of the systems to a range of traffic types.
3.

Availability and Quality of Data: The datasets needed to be either publicly accessible, or accessible through permission granted by the dataset authors, and of high quality, with minimal errors or inconsistencies. If the datasets were not available they could not be analysed, and thus would not have been viable for the study.
4.

Popularity: We prioritised datasets that are commonly used within the research community for evaluating methodologies. More popular datasets would not only indicate higher levels of use and thus quality, but be more likely to contain well-formatted and consistent data that would be more easily transferable.

By applying these criteria, we were able to select datasets that not only provided a comprehensive and realistic environment for testing but also aligned well with the capabilities and requirements of the IDSs being evaluated. The selection of datasets evolved slightly over the experimentation period as certain selected datasets became difficult to process and use on the selected IDSs.

III-B2 Evaluated Datasets Used in Results

Datasets like CICIDS2017 and UNSW-NB15 provide a balanced and realistic representation of network traffic and modern attack types. Their labelled nature and comprehensive feature sets made them more suitable for our analysis. The IoT-specific datasets (Stratosphere IoT, Mirai, BoT-IoT, and ToN-IoT) were chosen for their relevance to current and emerging threats in the IoT domain, an area of growing importance in network security and one where many new ML IDSs are focused. Table II provides an overview of the datasets that were used in our evaluations.

TABLE II: Datasets Used for Evaluation

Dataset	Characteristics	Relevance and Reason for Selection
CICIDS2017 [32]	Includes traffic from various devices and operating systems. Labelled with 80 features over 5 days.	Comprehensive range of attacks; ideal for evaluating modern IDSs due to diversity and extensive feature set.
UNSW-NB15 [33]	Generated by ACCS with 49 features and 9 attack types over 2 days.	Represents a wide spectrum of contemporary attack types, providing a broad base for IDS effectiveness testing.
Stratosphere IoT CTU [34]	Focuses on IoT network traffic, with realistic threat and behaviour representation.	Essential for understanding IDS effectiveness in IoT environments due to its focus on realistic IoT-specific threats.
Mirai (Kitsune) [7]	Data specific to Mirai botnet attacks, used with the Kitsune IDS.	Demonstrates significant Mirai threat in IoT, allowing for practical assessment of IDS capabilities against IoT botnets.
BoT-IoT [35] & ToN-IoT [36]	Encompasses legitimate and emulated IoT network traffic.	Offers a balanced view of IDS performance in IoT settings, serving as a robust alternative to the Kitsune dataset.

III-B3 Examined Datasets Not Included in Experimentation

In addition to selecting datasets that were most suitable for our study, we also considered several datasets that we ultimately did not use. The decision to exclude certain datasets was based on a set of criteria developed to ensure the most effective and comprehensive evaluation of the IDS solutions. Below is an overview of the datasets we did not use, the reasons for their exclusion, and the criteria we employed in selecting datasets. Table III provides an overview of the datasets that were excluded in our evaluations.

TABLE III: Datasets Considered but Not Used for Evaluation

Dataset	Characteristics	Relevance and Reason for Exclusion
KDD-Cup [37] & NSL-KDD [38]	Historically significant but outdated, lacking pcap files.	Not representative of current network behaviours; incompatible with selected IDSs due to lack of pcap files.
CAIDA [39]	Limited attack diversity and lacks full network data, unlabelled.	Unable to train auto-encoders on the dataset due to lack of labelled results.
CIDDS [40] [41]	Designed for anomaly-based network security.	Not widely used in literature, suggesting potential limitations for analysis.
ISCX2012 [42]	Older dataset without features	Due to lack of features, other datasets were determined to be more suitable
CICIDS2019 [43]	Modern DDoS Dataset containing a variety of DDoS attack types.	Strong modern DDoS dataset, but was not chosen due to the specific nature of attacks when compared to more general datasets used.
Kyoto [44]	Realistic, unsimulated dataset derived from diverse honeypots.	Offers a different perspective to generated datasets, but not highly cited.
LBNL [45]	Heavy anonymisation and absence of payload data.	Limits the depth of analysis for IDSs, making it less favourable for in-depth IDS evaluation.
CICIDS2018 [32]	Diverse traffic and heavy volume without specific pcaps.	Only available as 250gb file, data wrangling complexity and volume make processing unwieldy.
ASNM Datasets [46]	NIDS anomaly-based datasets developed for machine learning.	Attack diversity is limited and not as well-cited as many other options.
IoTID [47]	Newer IoT Dataset that aimed to target new IoT intrusion methods	Narrow dataset that is not as popular as the other chosen IoT datasets.
CICDOS2017 [48]	DoS Dataset generated by CIC based on the ISCX dataset	Narrow dataset without attack diversity of CIC dataset from the same year.

As our experimentation began, we also had to begin to consider the availability and adaptability of the data to work with the IDSs under test. We faced challenges with datasets that were not standardised, lacked certain features, or were formatted in ways incompatible with our IDS solutions. For example, datasets like KDD’99 and NSL-KDD, while historically significant, posed difficulties due to their outdated nature and lack of pcap files. Similarly, datasets like CAIDA, while useful for DDoS attack analysis, offered limited diversity in attack types and were not labeled.

IV Testing and Evaluation

Our testing and evaluation methodology for IDSs was designed to provide a comprehensive assessment of each system’s effectiveness. This section outlines the approach taken to test each IDS against the selected datasets, the metrics and criteria used for performance evaluation, and the approach for handling and interpreting the results.

IV-A Methodology for Testing Each IDS

1.

Data Preprocessing and Sampling: Significant preprocessing of the dataset files was necessary, including converting datasets into the required format and extracting relevant features for compatibility with each IDS. Random flow sampling was performed on these processed files when the size of dataset files inhibited complete testing.
2.

Handling Temporal Statistics: After random sampling of the packets, the results were sorted by their timestamp. This ensured that the IDSs received data that preserved the temporal statistics of the input packets. This step was performed as the IDSs utilise the temporal statistics and trends of input packets.
3.

IDS Configuration and Deployment: Each IDS was configured according to the standard setup instructions provided by the developers, without incorporating any customisations or optimisations that could enhance (or detract from) performance. The testing process involved processing the prepared datasets with these baseline configurations. The IDSs utilise machine learning algorithms to classify anomalies, and adjustments were not made to the model parameters shipped with the initial setup instructions, ensuring a uniform, out-of-the-box evaluation framework for all tested IDS solutions.
4.

Anomaly Detection Threshold: The anomaly detection threshold for each IDS was determined through a standardised process to ensure fairness in evaluation. This process involved identifying the threshold value that maximised the detection rate of anomalous packets while maintaining a tolerable level of false positives for the given results. The specific threshold value might differ across IDSs due to their varying sensitivity and detection mechanisms, but the methodology for determining this value remained consistent.

IV-B Evaluation Metrics and Criteria

The performance of the IDS solutions was evaluated using the standard evaluation metrics: accuracy, precision, recall, and F1 scores.

V Results

The overall results of empirical analysis are shown in Table IV. The source code used for our evaluation is available from hidden.

TABLE IV: Performance Results for Tested IDSs and Datasets

IDS: Kitsune
Dataset	Acc.	Prec.	Rec.	F1
UNSW-NB15	0.6954	0.0221	0.2136	0.0401
BoT_IoT	0.9923	0.8153	0.8609	0.8375
CICIDS2017	0.5540	0.0109	0.9753	0.0216
Stratosphere	0.9921	0.9981	0.9027	0.9480
Mirai	0.8902	0.9999	0.8788	0.9354
Average:	0.8248	0.5693	0.7663	0.5565
IDS: HELAD
UNSW-NB15	0.9717	0.0201	0.0107	0.0140
BoT_IoT	0.9793	0.6916	0.9011	0.7826
CICIDS2017	0.6437	0.9682	0.3706	0.5360
Stratosphere	0.9846	0.9805	1.0000	0.9902
Mirai	0.8898	0.9939	0.8786	0.9327
Average:	0.8938	0.7284	0.6322	0.6511
IDS: DNN
UNSW-NB15	0.9820	0.9820	1.0000	0.9910
BoT_IoT	0.9770	0.9770	1.0000	0.9884
CICIDS2017	0.9800	0.9800	1.0000	0.9899
Stratosphere	0.2110	0.2110	1.0000	0.3485
Mirai	0.9060	0.9060	1.0000	0.9507
Average:	0.8112	0.8112	1.0000	0.8537
IDS: Slip
UNSW-NB15	0.8735	0.0000	0.0000	0.0000
BoT_IoT	0.0018	0.0000	0.0000	0.0000
CICIDS2017	0.9370	0.0037	0.0447	0.0068
Stratosphere	0.6745	0.8809	0.4739	0.6163
Mirai	0.8040	0.1243	0.0159	0.0282
Average:	0.6582	0.2018	0.1069	0.1303

•

*Bolded value: the highest value of all IDSs for the metric column.
•

**Font colour blue: the highest F1 score of all IDSs for the dataset.

These results shows that using the DNN [18] has the most versatility across all datasets tested, achieving the highest average F1 score amongst all tested IDSs. However, it did perform worse on the Stratosphere dataset, indicating that it may not be the most optimal solution depending on user requirements. Given the Stratosphere dataset focuses on IoT network traffic, at the same time the BoT-IoT also focuses on IoT network traffic, the performance difference between the two datasets requires further investigation in terms of dataset differences to better understand why we observe such differences in the performances. However, this is outside the scope of this paper. On the other hand, HELAD achieved the highest average Accuracy, but given the datasets are not fully balanced, this is not the best indicator of which IDS performs the best. Hence, when proposing IDSs, the authors should present various metrics to measure the performance of their IDSs for ease of comparison against others.

Several factors identified throughout the testing process can explain the sub-optimal results observed in some cases:

1.

Inconsistent Performance Across Datasets: The results show significant variation in the performance of different IDS models across datasets. For instance, Kitsune demonstrated high accuracy in the IoT-focused datasets BoT-IoT and Stratosphere (0.9923 and 0.9921 respectively), but significantly lower accuracy in CICIDS2017 (0.5540). This disparity suggests that the effectiveness of an IDS model can be highly dependent on the specific characteristics of the dataset.
2.

Overfitting to Specific Dataset Characteristics: Kitsune’s high accuracy in IoT datasets (BoT-IoT and Stratosphere) versus its lower performance in CICIDS2017 may also suggest overfitting to certain dataset traits, impairing its effectiveness in more diverse network conditions.
3.

High False Positives/Negatives in Certain Scenarios: The performance metrics, particularly the precision and recall values, indicate potential issues with false positives and negatives. For example, in the CICIDS2017 dataset, HELAD showed a high precision of 0.9682, but a much lower recall of 0.3706, implying a tendency to miss actual attacks (high false negatives).
4.

Dataset and IDS Model Compatibility: The specific structure of individual datasets could have impacted the IDS models differently. The DNN showed exceptional performance on UNSW-NB15, BoT-IoT, and CICIDS2017 (with F1 scores above 0.98) but struggled significantly with the Stratosphere dataset (F1 score of 0.3485). This suggests that specific features of Stratosphere may not be well-handled by this particular model and may warrant further exploration.
5.

Impact of Preprocessing on Model Efficacy: The preprocessing steps necessary for compatibility could have differentially impacted the models. For instance, the poor performance of the DNN on Stratosphere (0.2110 accuracy) compared to its high performance on other datasets may also indicate preprocessing issues specific to this dataset.
6.

Lack of Representative Benign Traffic: Some datasets do not explicitly provide a benign traffic baseline. High precision but low recall in models like HELAD on datasets such as CICIDS2017 may indicate insufficient representation of benign traffic. This can lead to higher false positives, as the model is unable to recognise normal network behaviours effectively.

Through this comprehensive analysis, we aimed to provide not just an empirical comparison of different IDS solutions but also a deeper understanding of the factors influencing their performance, essential for advancing network security and guiding the development of more effective IDS.

VI Discussion

In our evaluation, we encountered a series of insights and challenges pivotal for understanding the current state and future direction of IDS technologies. This discussion section delves into the key findings, their implications, and the broader context of network security.

VI-A Interpretation of Results

Our results indicated a significant variation in the performance of different IDS solutions across various datasets. Notably:

1.

Performance Variability: The effectiveness of IDS varied significantly depending on the dataset used. This variability underscores the importance of selecting diverse datasets for IDS evaluation, ensuring that a diverse set of attack scenarios are considered and potentially different areas of strength based on both the IDS and its configuration options.
2.

Dataset-Specific Challenges: The performance of IDS solutions may be adversely affected by datasets that lack a representative benign traffic sample. The evaluation of the system may also be affected by the variety of attack types present in the dataset. This highlights the need for diverse and comprehensive datasets in IDS testing.
3.

Preprocessing Impact: The preprocessing steps necessary for formatting and compatibility may play a crucial role in IDS performance. In some cases, this could lead to data loss or the introduction of errors, impacting the accuracy of the systems.
4.

Challenges in Handling Real-World Network Dynamics: The struggle of some models with datasets that include a wider range of traffic patterns highlights the difficulty in adapting to the complexities of real-world network dynamics, especially without custom configuration of IDS solutions.
5.

Adaptability to Evolving Threats: The results also highlight the challenge for IDS solutions to adapt to broad and evolving threats. Static datasets may not encompass emerging attack vectors, necessitating continual updates and testing against new threat scenarios. The addition of UNSW and CICIDS2017 in our results highlighted potential gaps in the capabilities of IDSs, but without these, we may have presented inflated results.
6.

Need for More Comprehensive Testing: The varied results across different datasets indicate the need for more comprehensive testing conditions to evaluate the robustness of IDS models

VI-B Insights from IDS Performance on Specific Datasets

1.

Dataset Compatibility: The IDS solutions demonstrated high effectiveness when tested on datasets they were either designed for or shipped with. This suggests that these systems can be highly effective when the data characteristics align closely with their underlying detection algorithms.
2.

Role of Benign Traffic Profile: The strong results on the Stratosphere CTU IoT Dataset underscore the importance of having a well-defined benign traffic profile. This dataset, containing a realistic representation of IoT network traffic, including both normal and malicious activities, provided an ideal environment for IDS to distinguish between benign and malicious behaviours accurately.
3.

Model Overfitting and Generalisation: The results suggest a potential issue of overfitting in some IDS models. For instance, a model showing high accuracy in specific datasets (like Kitsune in BOT IoT) but performing poorly in others (like CICIDS2017) may indicate a lack of generalisation capability.
4.

False Positives/Negatives Concerns: Disparities in precision and recall metrics across datasets indicate issues with false positives or negatives. For example, HELAD’s high precision but low recall in CICIDS2017 points to a tendency to miss attacks, emphasising the need for a balanced approach to anomaly detection.
5.

Customisation and Tuning: These observations suggest that IDS solutions, when appropriately customised and tuned to specific network environments and traffic profiles, can offer robust detection capabilities. This customisation is crucial to the eventual performance of the system, though. Plug-and-play IDSs may not offer adequate performance in certain circumstances.

VI-C Practical Implications for Deployment

Our findings have several practical implications for the deployment of IDS solutions in real-world scenarios:

1.

Customisation for Network Environment: The variability in performance across different datasets, such as the high F1 score of 0.9480 for Stratosphere and the low F1 score of 0.0216 for CICIDS2017 in the Kitsune IDS, underscores the need for tailoring or optimising of IDS solutions for specific network environments.
2.

Need for Dynamic Testing Datasets: The disparities in IDS performance across datasets, such as HELAD’s high performance in a narrower dataset such as Stratosphere (F1 score of 0.9902) versus its lower performance in the more general UNSW-NB15 (F1 score of 0.0140), support calls for developing dynamic testing datasets that accurately simulate real-world conditions, including a mix of known threats and benign activities [9].
3.

Ongoing Adaptation and Update: The landscape of network threats is continually evolving, both datasets and IDSs can quickly fall behind current trends, this necessitates regular updates and adaptations of both IDS solutions and testing datasets to new threats and network conditions [1].
4.

Holistic Security Approach: The varying effectiveness of different IDS solutions (e.g., Slip’s low average F1 score of 0.1303) suggests that relying solely on IDS is insufficient. Instead, IDS should be integrated into a multi-layered security approach for robust protection.

VI-D Future Directions and Recommendations

Based on our study, we recommend the following directions for future research and development in IDS:

1.

Code availability and documentation: One critical issue was not being able to access all codes necessary to run an IDS, or the lack of documentation to perform trouble shooting or customisation of the IDS. As shown in Table I, many IDSs had error-based issues that rejected them from being analysed. We attempted to contact authors for all these occasions without success, hindering the use of these IDSs albeit their great performance reported in the paper.
2.

Development of comprehensive datasets: There is a growing need for more comprehensive and diverse datasets that accurately reflect current network environments and attack vectors, as many recent IDSs are based on machine learning techniques. As shown in Table IV, the IDSs performed differently even though the datasets were capturing the same type of attacks (e.g., Stratosphere and BoT-IoT). Moreover, the standardisation of datasets and IDS input data is also of paramount importance, as not all dataset creators provide the same information as others (e.g., pcaps only, flows only, unlabelled pcaps, etc.).
3.

Virtualisation: The analysis is difficult to perform due to different dependency requirements for IDSs. By providing a virtual environment, we can significantly reduce the code-based errors to perform evaluation. However, we have not seen any IDS solutions provided in a virtualised environment, hindering the evaluation processes.

VII Conclusion

Creating a versatile IDS is a challenging task due to the complex nature of intrusions and the environment the IDS operates. Our IDS analysis process provides an overview of how different recent IDSs performed when considering various datasets. Amongst them, the DNN solution [18] was the most versatile IDS, achieving the highest average F1 score of 0.8537 (followed by HELAD with F1 score of 0.6511). However, HELAD was better performing considering the test dataset Stratosphere, indicating that depending on the use cases, the best IDS to choose will differ. Further, our results and experience showed a gap between research and practical applications, with many challenges faced when trying to use other IDSs from both academia and public projects, as well as the discrepancies in datasets for evaluating IDSs. Therefore, we also suggest future IDS developers various aspects for consideration in order to better standardise the evaluation and adoption of IDSs in the future, which will enable better utilisation of them by users who can better understand the capabilities and limitations of IDSs.

References

[1] A. Khraisat, I. Gondal, P. Vamplew, and J. Kamruzzaman, “Survey of intrusion detection systems: techniques, datasets and challenges,” Cybersecurity, vol. 2, no. 1, pp. 1–22, 2019.
[2] Z. Yang, X. Liu, T. Li, D. Wu, J. Wang, Y. Zhao, and H. Han, “A systematic literature review of methods and datasets for anomaly-based network intrusion detection,” Computers & Security, vol. 116, p. 102675, 2022.
[3] Z. Azam, M. M. Islam, and M. N. Huda, “Comparative analysis of intrusion detection systems and machine learning based model analysis through decision tree,” IEEE Access, 2023.
[4] R. Ahmad, I. Alsmadi, W. Alhamdani, and L. Tawalbeh, “A comprehensive deep learning benchmark for iot ids,” Computers & Security, vol. 114, p. 102588, 2022.
[5] Y. Song, S. Hyun, and Y.-G. Cheong, “Analysis of autoencoders for network intrusion detection,” Sensors, vol. 21, no. 13, p. 4294, 2021.
[6] L. Liu, G. Engelen, T. Lynar, D. Essam, and W. Joosen, “Error prevalence in nids datasets: A case study on cic-ids-2017 and cse-cic-ids-2018,” in 2022 IEEE Conference on Communications and Network Security (CNS). IEEE, 2022, pp. 254–262.
[7] Y. Mirsky, T. Doitshman, Y. Elovici, and A. Shabtai, “Kitsune: an ensemble of autoencoders for online network intrusion detection,” in Proceedings 2018 Network and Distributed System Security Symposium, 2018.
[8] A. Binbusayyis and T. Vaiyapuri, “Identifying and benchmarking key features for cyber intrusion detection: An ensemble approach,” IEEE Access, vol. 7, pp. 106 495–106 513, 2019.
[9] S. Layeghy, M. Gallagher, and M. Portmann, “Benchmarking the benchmark-comparing synthetic and real-world network ids datasets,” Available at SSRN 4141050, 2022.
[10] M. Ghurab, G. Gaphari, F. Alshami, R. Alshamy, and S. Othman, “A detailed analysis of benchmark datasets for network intrusion detection system,” Asian Journal of Research in Computer Science, vol. 7, no. 4, pp. 14–33, 2021.
[11] A. A. Cárdenas, J. S. Baras, and K. Seamon, “A framework for the evaluation of intrusion detection systems,” in 2006 IEEE Symposium on Security and Privacy (S&P 2006), 21-24 May 2006, Berkeley, California, USA. IEEE Computer Society, 2006, pp. 63–77.
[12] A. Milenkoski, M. Vieira, S. Kounev, A. Avritzer, and B. D. Payne, “Evaluating computer intrusion detection systems: A survey of common practices,” ACM Comput. Surv., vol. 48, no. 1, pp. 12:1–12:41, 2015.
[13] S. Ayoubi, G. Blanc, H. Jmila, T. Silverston, and S. Tixeuil, “Data-driven evaluation of intrusion detectors: A methodological framework,” in Foundations and Practice of Security - 15th International Symposium, FPS 2022, Ottawa, ON, Canada, December 12-14, 2022, Revised Selected Papers, ser. Lecture Notes in Computer Science, G. Jourdan, L. Mounier, C. M. Adams, F. Sèdes, and J. García-Alfaro, Eds., vol. 13877. Springer, 2022, pp. 142–157.
[14] Z. K. Maseer, R. Yusof, N. Bahaman, S. A. Mostafa, and C. F. M. Foozy, “Benchmarking of machine learning for anomaly based intrusion detection systems in the cicids2017 dataset,” IEEE access, vol. 9, pp. 22 351–22 370, 2021.
[15] M. Antunes, L. Oliveira, A. Seguro, J. Veríssimo, R. Salgado, and T. Murteira, “Benchmarking deep learning methods for behaviour-based network intrusion detection,” in Informatics, vol. 9, no. 1. MDPI, 2022, p. 29.
[16] M. Sarhan, S. Layeghy, N. Moustafa, and M. Portmann, “Netflow datasets for machine learning-based network intrusion detection systems,” in Big Data Technologies and Applications: 10th EAI International Conference, BDTA 2020, and 13th EAI International Conference on Wireless Internet, WiCON 2020, Virtual Event, December 11, 2020, Proceedings 10. Springer, 2021, pp. 117–135.
[17] Y. Zhong, W. Chen, Z. Wang, Y. Chen, K. Wang, Y. Li, X. Yin, X. Shi, J. Yang, and K. Li, “Helad: A novel network anomaly detection model based on heterogeneous ensemble learning,” Computer Networks, vol. 169, p. 107049, 2020.
[18] R. K. Vigneswaran, R. Vinayakumar, K. Soman, and P. Poornachandran, “Evaluating shallow and deep neural networks for network intrusion detection systems in cyber security,” in 2018 9th International conference on computing, communication and networking technologies (ICCCNT). IEEE, 2018, pp. 1–6.
[19] S. Garcia, “Stratosphere ips,” https://github.com/stratosphereips/StratosphereLinuxIPS/tree/master, 2023, gitHub repository.
[20] A. Shah, S. Clachar, M. Minimair, and D. Cook, “Building multiclass classification baselines for anomaly-based network intrusion detection systems,” in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA). IEEE, 2020, pp. 759–760.
[21] F. Mosaiyebzadeh, L. G. A. Rodriguez, D. M. Batista, and R. Hirata, “A network intrusion detection system using deep learning against mqtt attacks in iot,” in 2021 IEEE Latin-American Conference on Communications (LATINCOM). IEEE, 2021, pp. 1–6.
[22] K. Cao, J. Zhu, W. Feng, C. Ma, M. Liu, and T. Du, “Network intrusion detection based on dense dilated convolutions and attention mechanism,” in 2021 International Wireless Communications and Mobile Computing (IWCMC). IEEE, 2021, pp. 463–468.
[23] P. Bedi, N. Gupta, and V. Jindal, “I-siamids: an improved siam-ids for handling class imbalance in network-based intrusion detection systems,” Applied Intelligence, vol. 51, pp. 1133–1151, 2021.
[24] O. Foundation, “Securetea-project,” https://github.com/OWASP/SecureTea-Project, 2023, gitHub repository.
[25] L. Yang and A. Shami, “Iot data analytics in dynamic environments: From an automated machine learning perspective,” Engineering Applications of Artificial Intelligence, vol. 116, p. 105366, 2022.
[26] O. Belarbi, A. Khan, P. Carnelli, and T. Spyridopoulos, “An intrusion detection system based on deep belief networks,” in International Conference on Science of Cyber Security. Springer, 2022, pp. 377–392.
[27] R. Saini, D. Halder, and A. M. Baswade, “Rids: Real-time intrusion detection system for wpa3 enabled enterprise networks,” in GLOBECOM 2022-2022 IEEE Global Communications Conference. IEEE, 2022, pp. 43–48.
[28] L. Yang and A. Shami, “Ids-ml: An open source code for intrusion detection system development using machine learning,” Software Impacts, vol. 14, p. 100446, 2022.
[29] F. Wei, H. Li, Z. Zhao, and H. Hu, “Xnids: Explaining deep learning-based network intrusion detection systems for active intrusion responses,” in 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 2023.
[30] O. I. S. Foundation, “Suricata,” https://github.com/OISF/suricata, 2023, gitHub repository.
[31] I. Sharafaldin, A. Gharib, A. H. Lashkari, and A. A. Ghorbani, “Towards a reliable intrusion detection benchmark dataset,” Software Networking, vol. 2017, no. 1, pp. 177–200, 2017.
[32] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a new intrusion detection dataset and intrusion traffic characterization.” ICISSp, vol. 1, pp. 108–116, 2018.
[33] N. Moustafa and J. Slay, “Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set),” in 2015 military communications and information systems conference (MilCIS). IEEE, 2015, pp. 1–6.
[34] . M. J. E. Sebastian Garcia, Agustin Parmisano, “Iot-23: A labeled dataset with malicious and benign iot network traffic,” 2020.
[35] J. Ashraf, M. Keshk, N. Moustafa, M. Abdel-Basset, H. Khurshid, A. D. Bakhshi, and R. R. Mostafa, “Iotbot-ids: A novel statistical learning-enabled botnet detection framework for protecting networks of smart cities,” Sustainable Cities and Society, vol. 72, p. 103041, 2021.
[36] N. Moustafa, “A new distributed architecture for evaluating ai-based security systems at the edge: Network ton_iot datasets,” Sustainable Cities and Society, vol. 72, p. 102994, 2021.
[37] I. UCI, “Kdd cup 1999,” https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, 2020.
[38] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the kdd cup 99 data set,” in 2009 IEEE symposium on computational intelligence for security and defense applications. Ieee, 2009, pp. 1–6.
[39] “Anonymized Internet Traces April 2008 - January 2019,” https://doi.org/10.21986/CAIDA.DATA.ANONYMIZED-INTERNET-TRACES, dates used: ¡date(s) used¿. Accessed: ¡date accessed¿.
[40] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, and A. Hotho, “Flow-based benchmark data sets for intrusion detection,” in Proceedings of the 16th European Conference on Cyber Warfare and Security (ECCWS). ACPI, 2017, pp. 361–369.
[41] ——, “Creation of flow-based data sets for intrusion detection,” Journal of Information Warfare, vol. 16, pp. 40–53, 2017.
[42] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward developing a systematic approach to generate benchmark datasets for intrusion detection,” computers & security, vol. 31, no. 3, pp. 357–374, 2012.
[43] I. Sharafaldin, A. H. Lashkari, S. Hakak, and A. A. Ghorbani, “Developing realistic distributed denial of service (ddos) attack dataset and taxonomy,” in 2019 International Carnahan Conference on Security Technology (ICCST). IEEE, 2019, pp. 1–8.
[44] J. Song, H. Takakura, Y. Okabe, M. Eto, D. Inoue, and K. Nakao, “Statistical analysis of honeypot data and building of kyoto 2006+ dataset for nids evaluation,” in Proceedings of the first workshop on building analysis datasets and gathering experience returns for security, 2011, pp. 29–36.
[45] R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, and B. Tierney, “A first look at modern enterprise traffic,” in Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement, 2005, pp. 2–2.
[46] I. Homoliak, K. Malinka, and P. Hanacek, “Asnm datasets: A collection of network attacks for testing of adversarial classifiers and intrusion detectors,” IEEE Access, vol. 8, pp. 112 427–112 453, 2020.
[47] I. Ullah and Q. H. Mahmoud, “A scheme for generating a dataset for anomalous activity detection in iot networks,” in Canadian conference on artificial intelligence. Springer, 2020, pp. 508–520.
[48] H. H. Jazi, H. Gonzalez, N. Stakhanova, and A. A. Ghorbani, “Detecting http-based application layer dos attacks on web servers in the presence of sampling,” Computer Networks, vol. 121, pp. 25–36, 2017.