Introduction

Keeping track of network security is a major concern in modern communication networks. SDN provides a new architecture for the network, which separates the data forwarding functions from control logical functions. SDN enables dynamic changes to thread policies and facilitates network programmability1. Also, it manages its operators to develop different applications by making use of the new logically centralized control view of the network, and this have increased the flexibility and programmability of the network. Unfortunately, the new architecture brings new attack vectors2, notably in the SDN controller3. Securing such architecture requires addressing several aspects, one of these points is the SDN controller which serves as the brain of the SDN network4 and the primary concern of this paper.

To enhance overall SDN network security for future deployments, we must address all the security issues that came up with the new architecture5. One of the key strategies for preventing cyber-attacks is the prompt identification of the attack process using an intrusion detection system (IDS). A network intrusion detection system (IDS) is employed to identify unauthorized activities on a digital network6. Due to the programmability of the SDN, we can develop the IDS to work as an SDN application. There are primarily two different types of IDS: signature-based IDS, and anomaly-based IDS. The main difference between these systems is the ability of anomaly-based to detect unknown cyber-attacks, whereas signature-based can only identify well-known patterns existing in the system.

Machine Learning (ML) contains significant techniques that can be applied within SDN architecture to enhance both functional and non-functional aspects of the network, such as security, performance, and others7. Deep learning (a subset of Machine Learning) models have the ability to deal with large-scale data and it demonstrated success across various fields8. However, it is difficult to improve network IDS with ML/DL techniques, because it requires train and assessing the model with correctly labeled normal and malicious samples9. Researchers utilizing ML/DL-based IDS in SDN networks have found that the scalability of both SDN and ML/DL algorithms has significantly improved IDS accuracy10. Machine Learning (ML) and Deep Learning (DL) approaches show the power of modern algorithms in detecting patterns and anomalies that traditional methods might miss. A well-trained ML/DL model can distinguish between normal and malicious traffic, which provides a robust defense mechanism against cyber threats. Various ML/DL techniques such as Naïve Bayes, Support Vector Machine (SVM), Logistic Regression, Neural Networks (NN), Convolutional Neural Networks (CNN), Recurrence Neural Networks (RNN), Long-Short Term Memory (LSTM), and lately Transformers can be used to build effective models.

In this paper, we have created two DL models for constructing intrusion detection systems, utilizing state-of-the-art techniques to enhance detection accuracy and reduce false alarm rates. We evaluated our models’ performance using accuracy, precision, recall, and F1 score.

To the best of our knowledge, we are the pioneers in utilizing the Transformer model along with the InSDN dataset for SDN security. The rest of this paper is as follows. Section 2 will cover related works. Section 3 provides a brief background on LSTM and Transformers. Section 4 introduces our models’ architectures. Section 5 presents and compares the results obtained. Section 6 contains the conclusion of the paper. Finally, Sect. 7 outlines potential future work.

Related works

This study focuses on deep learning (DL) approaches, abandoning algorithms such as K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Logistic Regression (LR), Decision Trees (DTs), and Naive Bayes (NB) classifiers. Because all these algorithms rely on pre-defined features can limit the models’ capabilities and are known for their high false alarm rates11. In contrast, DL does not require any feature extraction. It is capable of investigating raw data deep structures and automatically extracting relationships between different data records. Over the past years, DL techniques have been widely used for building Intrusion Detection System (IDS) models. This section introduces the most recent DL approaches for that purpose.

Tuan Anh Tang et al.12 employed a Gated Recurrent Unit Recurrent Neural Network (GRU-RNN) they tested their model using NSL-KDD and CICIDS2017 datasets with a classification accuracy of 89% and 99% respectively. Both datasets are considered old and based on traditional networks, which don’t represent SDN modern networks fairly.

S. S. Volkov et al.13 used Long Short-Term Memory (LSTM) based neural networks to build their model. They used the CSE-CICIDS2018 dataset for training and testing. However, CSE-CICIDS2018 contains several different attack scenarios such as Brute force, Botnet, web-attack, Dos, and DDos. It was also generated from traditional network flows.

A. Soliman et al.14 introduced models relays on Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU). They used the InSDN dataset for training and testing, they tested their models using two different feature sets one including 48 features which achieved accuracies between 98.0% and 98.85 and another including 6 features that achieved accuracies range between 91.1 and 92.5%.

Wang et al.15 designed a hybrid structure called DDosTC, that combines the Transformers attention technique and Convolution Neural Network (CNN) to detect distributed denial-of-service (DDoS) attack on SDN, the model was tested and evaluated using the most recent CICDoS2019 dataset.

M. S. Elsayed et al.11 developed another hybrid model that combines Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) this time. They achieved an accuracy of 96.32% on the InSDN dataset.

Zihan Wu et al.16 formulated a robust Transformer-based Intrusion Detection System (RTIDS). The model utilizes positional embedding to capture sequential information between features, followed by a stacked encoder-decoder neural network to learn low-dimensional features. In addition to a self-attention mechanism to improve classification. They demonstrate the effectiveness of RTIDS on the CICID2017 and CICDDoS2019 datasets, achieving F1-Scores of 99.17% and 98.48% respectively.

In Z Long et al.17 a simple encoder-decoder Transformer was used for classifying CICIDS2018 attacks, the model achieved an F1-score of 93.9% and 48.9% for DoS AND DDoS respectively.

V. Hnamte et al.18 used Convolution Neural Networks (CNN) along with Bidirectional Long-short Memory (BiLSTM), to create a model which achieves 100% and 99.64% accuracy rates on CICIDS2018 and Edge_IIoT datasets, respectively, according to their experiments.

Ivandro O. Lopes et al.19 designed and implemented a set of temporal-based convolutional models. They evaluated their models against CICDDoS2019 and CSE-CIC-IDS2018 datasets. Their models achieved performance in the range between 98.07% and 99.95% for the majority of considered metrics, according to their experiments.

Ganesh Khekare et al.20 propose a hybrid Generative Adversarial Network-Recurrent Neural Network (GAN-RNN) model. The model consists of two parts, a GAN architecture that creates custom synthetic traffic patterns that are used for enhancing security analysis, and an RNN for predicting the upcoming network attacks. The CICIDS2017 dataset achieved 99.4% for both accuracy and F1-score metrics.

A. Meliboev et al.21 tried several models, including Convolutional Neural Network (CNN), Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Recurrent Neural Network, and CNN-LSTM model. They used The KDD_cup99, NSL-KDD, and UNSW_NB15 datasets were used to evaluate the proposed models. The lowest accuracy was observed in RNN with UNSW_NB15 (71.9%), and the highest was for CNN with KDD_cup99 (95.2%).

V. Hnamte et al.22 proposed a deep learning hybridized technique that consists of two stages, Long-Short Term Memory (LSTM) and Auto-Encoder (AE). Performance was evaluated using CICIDS2017 and CSE-CICIS2018 datasets which achieved 99.99% and 99.1% accuracy, respectively.

M. Mahmoud et al.23 employed an Autoencoder combined with LSTM to perform intrusion detection in an IoT environment. They used the NSL-KDD dataset and achieved 98.88% accuracy. The authors utilized only Standard Scaling and not any other preprocessing steps.

A. S. Issa et al.24 proposed a distributed denial-of-service (DDoS) attack mitigator based on the CNN-LSTM model. The NSL-KDD dataset was also used here for training and testing the model, they achieved 99.02% accuracy. However, due to the dataset’s age, we can’t rely on it for detecting modern network attacks.

James Dzisi et al.25 proposed a hybrid model that consists of RNN and LSTM for mitigating DDoS attacks on SDN controllers. The authors built their dataset using a virtual environment. They investigate various split ratios 82/20, 70/30, and 60/40. Their best accuracy was 89.63% achieved using a 70/30 split ratio.

T Zhang et al.26 introduce a Transformer-based model that mitigates the relay link forgery attack. The authors built their dataset using a simulated environment, the environment was built using Mininet and the Ryu controller. After a month of collecting data, they end up with 334,000 records. Their model achieves around 96.5% accuracy and 96.2% F1 score.

R. A. Elsayed et al.27 built a two-level intrusion detection system using a Long-Short Term Memory (LSTM) network. They used both the ToN-IoT and InSDN datasets for training and testing. Their experimental results were 96.56% and 99.39% accuracy for the ToN-IoT and InSDN datasets, respectively. They also achieved 97.35% and 98.45% as F1 Score for ToN-IoT and InSDN datasets, respectively.

Y. Li et al.28 proposed a combining Transformer, federated learning, and Paillier cryptosystem method. They used the Transformer for detection, it was deployed in edge nodes. The proposed approach enables collaborative training of a detection model using data from all edge nodes by using a federated learning framework. The authors compare their results with a CNN and an LSTM model. The dataset was generated using IEEE 14-bus and IEEE 118-bus systems. The proposed model achieved an accuracy above 90% for both strong and weak attacks.

After investigating a wide range of research, we noted that most of the used datasets were built using conventional networks, and adapted to serve as SDN network datasets. The InSDN dataset represents an exception, as it was built using SDN traffic so it will be our concern. Additionally, our models will be based on the CNN-LSTM technique which appears to be a promising technique, and another model based on the Transformer architecture, the most significant and popular DL technique today.

Background

Deep learning (DL) approaches have evolved from machine learning (ML) ones. In traditional ML, developers must manually choose the most important features or apply further analysis. Then, the models automatically figure out how to translate the features into outputs. On the other hand, DL constructs the features automatically in a multi-level process, starting with some chosen features or the whole possible set of features. Then, while going through the process, each level will contract some abstract features and feed them to the next level until reaching the output.

Convolutional Neural Network Long Short-Term Memory (CNN-LSTM)

First, convolutional neural networks (CNN), which considered the most important ML technique in the realm of image processing and later on in almost any classification problem. One of the advantages of CNNs is their ability to learn features from the data without using any feature extraction methods. Another advantage is the ability to find more features out of the produced features. The network learns to identify features through the back-propagation process. A typical CNN architecture Fig. 1 consists of a series of layers:

  • Input Layer: Represents the received inputs, it has a neuron for each number in the input data.

  • Convolution (CONV) Layer: This layer applies a set of learnable filters to the input. Each filter activates certain features from the input, and by changing the number of filters and filter size in each layer we can construct different features.

  • Activation Layer: Often we use Rectified Linear Unit (ReLU) as an activation function. Its mission is to introduce non-linearity into the network, allowing it to learn more complex features.

  • Pooling (POOL) Layer: This layer decreases the number of parameters, which is responsible for decreasing the model computations, by reducing the spatial size of the CONV layers.

  • Fully Connected Layer: It’s the final layer before the output layer. It is built using a fully connected network of neurons. It’s responsible for classifying the features learned by the CONV layers into the output classes.

  • Output Layer: This layer is crucial for producing the final results after processing the input through the previous layers. It typically consists of neurons that represent the classes for a classification task.

Fig. 1
figure 1

CNN architecture.

Second, Long short-term memory (LSTM), which considered another type of deep learning (DL) networks. Unlike CNN, It has a feedback architecture. It is an evolution of recurrent neural networks (RNN), which solve the problem of vanishing gradient. It can also learn long-term dependencies between different data records during the training process.

The construction units of an LSTM network are called cells Fig. 2, the cell itself has three types of gates input, output, and forget gates. The purpose of the gates is to regulate the flow of information, so they typically choose to keep or discard this information. Forget gate Eqation 1 decides which information is discarded from the cell, where σ represents the sigmoid function, W_f is the weight matrix for the forget gate, b_f is the bias term, h_{t-1} is the previous hidden state, and x_t is the current input vector. Input gate Eqation 2 decides which values to be updated in the candidate cell state Eqation 3, where W_i and W_C are the weight matrices b_i, and b_C are the basis terms for the input gate and candidate cell state respectively. The cell state is updated by forgetting the old state and adding the new candidate values Eqation 4. Finally, the output gate Eqation 5 decides what the next hidden state Eqation 6 is, where W_o is the weight matrix, and b_o is the bias term for the output gate. The * operator represents element-wise multiplication, and tanh function output values between − 1 and 1 to help regulate the network’s internal state.

$$\:{f}_{t}=\sigma\:({W}_{f}\cdot\:[{h}_{t-1},{x}_{t}]+{b}_{f})$$
(1)
$$\:{i}_{t}=\sigma\:({W}_{i}\cdot\:[{h}_{t-1},{x}_{t}]+{b}_{i})$$
(2)
$$\:{\stackrel{\sim}{C}}_{t}=\text{t}\text{a}\text{n}\text{h}({W}_{C}\cdot\:[{h}_{t-1},{x}_{t}]+{b}_{C})$$
(3)
$$\:{C}_{t}={f}_{t}*{C}_{t-1}+{i}_{t}*{\stackrel{\sim}{C}}_{t}$$
(4)
$$\:{o}_{t}=\sigma\:({W}_{o}\cdot\:[{h}_{t-1},{x}_{t}]+{b}_{o})$$
(5)
$$\:{h}_{t}={o}_{t}*\text{t}\text{a}\text{n}\text{h}\left({C}_{t}\right)$$
(6)
Fig. 2
figure 2

An LSTM cell.

Transformer

Transformer-based models come with a great impact on the field of deep learning. At the heart of these models, we have self-attention mechanisms that provide a novel architecture and avoid traditional sequential processing in favor of parallelized one. The Transformer architecture was first introduced in Vaswani et al. paper “Attention Is All You Need”29. The new architecture introduces the idea of parallel processing in the scope of Machine Learning (ML) modeling. Unlike traditional algorithms such as neural networks (RNNs) and convolutional neural networks (CNNs), that use sequence modeling.

The self-attention mechanism, which is the core of the Transformer architecture enables Transformers to capture long-range dependencies and patterns. Transformers typically consist of an encoder and a decoder component, by permuting these components we can obtain more than one Transformer architecture such as encoder-only, encoder-decoder, and decoder-only Transformer. In this paper, we chose to work with encoder-only architecture, which is considered to be the most suitable one for classification. The first step involves encoding the input, this step aims to summarize the input information by representing each feature as a high-dimensional vector through an embedding layer. Transformer-based models often employ multi-head attention to reinforce the power of the self-attention mechanism. The process projects the input into multiple subspaces, which allow the model to attend to different parts of the input simultaneously. The transformer architecture typically includes feed-forward neural networks (FFNNs) in each layer. These FFNNs introduce non-linearity to the model and enable it to learn complex relationships between features. The outputs of the FFNNs are then passed through another layer normalization step. Finally, the transformer workflow involves generating output classes. This is typically done using a softmax layer.

Proposed models

In this paper, we propose and compare two distinct models for network intrusion detection. The first is a hybrid architecture combining Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks, and the other utilizes a Transformer encoder-only architecture. We can list our contributions as the following:

  • Build the CNN-LSTM model, which managed us to capture local patterns using CNN layers and long-term dependencies using LSTM layers.

  • Build the Transformer model, which allowed us to efficiently process inputs in parallel and capture long-range dependencies.

  • Train the models using multi-classification and binary classification.

  • Efficiently reduce the training and testing set to only 4 features.

CNN-LSTM proposed model

In the development of the CNN-LSTM model for network intrusion detection, several preprocessing steps were undertaken. Initially, the dataset was cleaned to remove any inconsistencies or outliers that could affect the model’s performance. Following cleaning, the data was shuffled to randomize the order of samples. Finally, the data underwent normalization using standard scaling. The standard scaling process involves subtracting the mean and dividing by the standard deviation for each feature, that effectively transforming the data to have a mean of zero and a standard deviation of one Eq. 7. After normalization, the input data was reshaped from its original format of 48 by 1 to a more suitable format of 6 by 8, which is compatible with the subsequent CNN layers.

$$\:{x}_{scaled}=\frac{x-mean\left(x\right)}{std\left(x\right)}$$
(7)

The CNN-LSTM model architecture Fig. 3 comprised two convolutional layers followed by two max-pooling layers. The convolutional layers will help us capture local patterns and features by utilizing learnable filters to convolve over the input data. Subsequently, max-pooling layers were employed to downsize the feature maps. Following the convolutional layers, a single LSTM layer was incorporated into the model architecture. The LSTM layer played a crucial role in capturing learning long-term dependencies. After the LSTM layer, the network architecture included two dense layers. The purpose of these layers is to calculate the model weights, which will be used later for the classification process. Finally, a softmax layer was added to produce probabilistic predictions across the different classes.

Fig. 3
figure 3

Proposed CNN-LSTM model architecture.

Transformer model

In the development of the Transformer model, preprocessing steps similar to those of the CNN-LSTM model were undertaken. In the beginning, the dataset was cleaned to remove any inconsistencies or outliers. Then, the dataset was shuffled to randomize the order of samples. Lastly, it was normalized using standard scaling to standardize the features. Unlike the CNN-LSTM model, reshaping the input data was unnecessary for the Transformer model. Transformers’ self-attention mechanism can process input sequences without the need for reshaping.

The Transformer model architecture Fig. 4 comprised a decoder-only design for our classification task. The architecture began with a multi-headed self-attention layer with two heads. These heads enable the model to capture dependencies between different parts of the input sequence. Following the self-attention layer, a normalization layer was applied to stabilize the learning process. Subsequently, the architecture included five one-dimensional convolutional layers (conv1d layers), each followed by normalization itself. These convolutional layers served to further extract features from the input sequence. That enhances the model’s ability to discern patterns relevant to network intrusion detection. After the convolutional layers, a global average pooling 1D layer was incorporated into the architecture. The model concluded with a softmax dense layer at the end, just like the CNN-LSTM model.

Fig. 4
figure 4

Proposed Transformer model architecture.

Results and evaluation

In this section, we introduce a comprehensive evaluation of our models, the dataset we chose for training and testing, and the results we obtained.

Dataset

Numerous datasets have been developed for training network-based machine learning models. The InSDN30 dataset is a crucial resource in Software Defined Networking (SDN). It offers precious insights into network traffic behavior. As the authors mentioned, this dataset is derived from real-world network traffic in an SDN environment. It contains a comprehensive repository of packet-level information. It includes various network activities such as host communications, flow characteristics, and potential security incidents. The dataset effectively captures the complexities of SDN infrastructures and facilitates in-depth exploration of SDN deployment.

The InSDN dataset includes a wide range of network attacks along with normal traffic records with a total size reach to 343,889 records, as detailed in Table 1. Each record originally consisted of 77 raw features after removing socket information like IP addresses, flow IDs, etc. These features were subsequently reduced to 48 by other researchers, and further to 9 features. In our study, we refined the feature set even further, we were able to reduce it to 6 features as detailed in Table 2, and finally to only 4 features in Table 3.

Table 1 InSDN classes list.
Table 2 InSDN 6-features.
Table 3 InSDN 4-features.

Evaluation metrics

To assess our DL models, we tried some of the most famous evaluation metrics such as accuracy, precision, recall, and F1 score. These metrics provided us with a comprehensive understanding of our models’ performance in classifying both malicious and benign network traffic. As known the calculation of these metrics includes the four special values of the confusion matrix, which are:

  • True Positives (TP): Represent the correctly identified malicious records.

  • True Negatives (TN): Represent the correctly identified benign records.

  • False Positives (FP): Represent the incorrectly identified malicious records.

  • False Negatives (FN): Represent the incorrectly identified benign records.

Accuracy measures the proportion of correctly predicted records out of the total instances.

$$\:Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(8)

Precision measures the proportion of correctly predicted positive records out of the total predicted positive record.

$$\:Precision=\frac{TP}{TP+FP}$$
(9)

Recall measures the proportion of correctly predicted positive records out of the total actual positive records.

$$\:Recall=\frac{TP}{TP+FN}$$
(10)

The F1 score provides a single metric that balances the model’s precision and recall.

$$\:F1\:Score=\frac{2\:\times\:\:Precision\:\times\:\:Recall}{Precision\:+\:Recall}$$
(11)

Experimental results

The results presented in this section were obtained using a laptop with Windows 10 Pro, a Core i5-9300 H CPU, a GTX 1650 GPU, and 8 GB of RAM. This hardware configuration was employed to train and evaluate our models on the InSDN dataset. We used accuracy, precision, recall, and F1-score to assess the ability of each model to classify network traffic into the 8 classes mentioned in Table 1. We also, experimented with different numbers of features 48 features, 6 features, and 4 features. In Table 4, we present the summarized results of our experiments.

Table 4 Obtained results for both the CNN-LSTM and Transformer using 8 classes.

The Transformer model with 48 features achieved the highest accuracy at 99.02%. The CNN-LSTM one with the same number of features followed closely, achieving an accuracy of 99.01%, with the same F1 score percentages for both we cannot prefer one model over the other. These results indicate that the models perform well with a comprehensive set of features. When the number of features was reduced, the Transformer model with 6 features maintained a high accuracy of 98.90% against 98.80% for the CNN-LSTM. With another decaying in the features set to be only 4 features, the Transformer model proves its superiority with 98.50% accuracy, slightly lower but still significantly high compared with using the whole set of features. Detailed results are shown in Table 5.

Table 5 Detailed results for CNN-LSTM and Transformer using 8 classes.

When evaluating the performance of the CNN-LSTM and Transformer models using 48 features for each attack type, we observed that the results for normal traffic, DDoS, DoS, and probe attacks were satisfying across all four metrics. The models achieved good accuracy, precision, recall, and F1 scores for these attack types, as shown in Fig. 5.

Fig. 5
figure 5

Comparing results for both models using 8 classes.

However, the performance for brute force attack (BFA), web-attack, botnet, and user-to-root (U2R) attacks was significantly lower according to the results in Table 5. This poor performance can be attributed to the limited number of records available for these attack types in the dataset. We have only 17 records for U2R, 164 for botnet, 192 for web-attack, and 1405 for BFA. The scarcity of training data for these attacks likely hindered the models’ ability to learn patterns. That causes many misclassifications in these categories, which appears in the confusion matrix Fig. 6.

Fig. 6
figure 6

Confusion matrices for both CNN-LSTM and Transformer.

To solve the problem of low accuracy for BFA, U2R, web attacks, and botnet attacks, we decided to group them into one class called Others, hence we have 5 classes. The results of this experiment showed an improvement at all levels except for the 4 features set, as shown in Tables 6 and 7. Furthermore, we upsampled U2R, web attacks, and botnet attacks to be 1405 records, the same as BFA, and then we built another classifier specially for these 4 attacks Fig. 7. The special classifier was a simple CNN model, the results for this classifier were excellent and beyond expected, as shown in Table 8.

Table 6 Obtained results for both the CNN-LSTM and Transformer using 5 classes.
Table 7 Detailed results for CNN-LSTM and Transformer using 5 classes.
Fig. 7
figure 7

Two stages model for classifying others.

Table 8 Detailed results for others classes using CNN model.

The improvements in the results of the 5-classes classification have inspired us to perform a binary classification by merging all attack types into a single class, the results are shown in Table 9. This approach improved results for both models. Using 48 features, the accuracy for the CNN-LSTM model increased to 99.08%, and for the Transformer model, it increased to 99.16%. Remarkably, with 6 features, the CNN-LSTM model achieved an accuracy of 99.19%, surpassing its performance with 48 features. This result was unexpected and suggests that the model may be more efficient with a streamlined feature set. The Transformer model showed no improvement with 6 features. With 4 features, the CNN-LSTM model reached an accuracy of 98.86%, while the Transformer model improved significantly to 98.93%.

Table 9 Experimental results for both the CNN-LSTM and Transformer using binary classification.

Binary classification has improved the accuracy and F1 score for both models Fig. 8. The higher metrics come from reducing the complexity of distinguishing between multiple attack types. However, multi-class classification is more valuable for most security applications. Where specifying the type of attack allows for more targeted mitigation strategies.

Fig. 8
figure 8

Comparing results for both models using binary classification.

Results discussion

In this section, we compare the performance of our proposed CNN-LSTM and Transformer models with those from related works. We examine the accuracy and F1 score metrics to get insights about these different models. It wasn’t easy to set such a comparison because most of the related works use different datasets and different sets of features. However, we tried to compare our CNN-LSTM with the works that use the InSDN dataset Fig. 9. The comparison that appears in Table 10 demonstrates that our CNN-LSTM model with 6 features has only 0.2% less than27 which trained using 48 features and 0.73% higher in F1 score, and beat all the others. For, the Transformer model, we are the first group to apply this technique to the InSDN dataset.

Table 10 Comparing our proposed models’ results with other state-of-the-art approaches.
Fig. 9
figure 9

Comparing our results with related works that use the InSDN dataset.

Conclusion

In this study, we developed and evaluated two distinct models for network intrusion detection a hybrid CNN-LSTM model and a Transformer encoder-only model. Both models were trained and tested using the InSDN dataset. We leveraged different numbers of features to assess their performance. Our evaluation metrics included accuracy, precision, recall, and F1 score, comprehensively assessing each model’s classification capabilities.

The Transformer model with 48 features achieved the highest accuracy. The CNN-LSTM model also performed well with 48 features, achieving a slightly lower accuracy. However, the CNN-LSTM model with 6 features overcomes the Transformer model. In contrast, both models maintained high accuracy even with fewer features. Both models are considered to be robust and capable of maintaining high performance with a reduced feature set. Another important metric that could be crucial is the training/testing time. The Transformer model has a significantly longer time, which needs to be considered when choosing between the models. This suggests that for our IDS application, where feature reduction and testing time are necessary. The CNN-LSTM model might be a more suitable choice.

When we evaluated the performance of these models for individual attack types using 48 features, we found that both models performed well for normal traffic, DDoS, DoS, and probe attacks. However, the performance was notably poorer for web attacks, botnet, BFA, and U2R attacks. We attribute that to the limited number of records available for these attack types. This pushed us to group all these poorly represented attacks in one group called Others and rerun the models. The models gave us better results, especially the Transformer. In addition, we create a CNN model specially for those 4 attacks to correctly reclassify a record if it was classified as Others by one of the original models.

Furthermore, we explored binary classification by merging all attack types into a single class. This approach yielded improved results for both models. But, it may not provide the granularity needed to distinguish between different types of attacks.

Future work

In the near future, we will try to investigate several avenues. We may explore advanced feature selection techniques on the raw InSDN WireShark files. We will include evaluations on larger datasets and consider metrics such as training and testing time, computational cost, and learning curves. We also can include some real-world network traffic to provide a more comprehensive assessment. Another promising direction in the deep learning field is the ensemble learning techniques, which could help us to enhance the strength of both CNN-LSTM and Transformer models. We will try our best to develop reliable security solutions for software-defined networks.