Unknown Traffic Recognition Based on Multi-Feature Fusion and Incremental Learning

Liu, Junyi; Wang, Jiarong; Yan, Tian; Qi, Fazhi; Chen, Gang

doi:10.3390/app13137649

Open AccessArticle

Unknown Traffic Recognition Based on Multi-Feature Fusion and Incremental Learning

by

Junyi Liu

^1,2

,

Jiarong Wang

^1,*

,

Tian Yan

¹,

Fazhi Qi

^1,3 and

Gang Chen

¹

Computing Center, Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China

²

School of Nuclear Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China

³

China Spallation Neutron Source Science Center, Dongguan 523803, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(13), 7649; https://doi.org/10.3390/app13137649

Submission received: 30 May 2023 / Revised: 23 June 2023 / Accepted: 24 June 2023 / Published: 28 June 2023

(This article belongs to the Special Issue Data-Driven Cybersecurity and Privacy Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate classification and identification of Internet traffic are crucial for maintaining network security. However, unknown network traffic in the real world can affect the accuracy of current machine learning models, reducing the efficiency of traffic classification. Existing unknown traffic classification algorithms are unable to optimize traffic features and require the entire system to be retrained each time new traffic data are collected. This results in low recognition efficiency, making the algoritms unsuitable for real-time application detection. To solve the above issues, we suggest a multi-feature fusion-based incremental technique for detecting unknown traffic in this paper. The approach employs a multiple-channel parallel architecture to extract temporal and spatial traffic features. It then uses the mRMR algorithm to rank and fuse the features extracted from each channel to overcome the issue of redundant encrypted traffic features. In addition, we combine the density-ratio-based clustering algorithm to identify the unknown traffic features and update the model via incremental learning. The cassifier enables real-time classification of known and unknown traffic by learning newly acquired class knowledge. Our model can identify encrypted unknown Internet traffic with at least 86% accuracy in various scenarios, using the public ISCX-VPN-Tor datasets. Furthermore, it achieves 90% accuracy on the intrusion detection dataset NSL-KDD. In our self-collected dataset from a real-world environment, the accuracy of our model exceeds 96%. This work offers a novel method for identifying unknown network traffic, contributing to the security preservation of network environments.

Keywords:

unknown traffic recognition; multi-feature fusion; deep neural networks; incremental learning

1. Introduction

Human society has now entered an era of big data as a result of the quick uptake of mobile Internet and the digital revolution. Particularly in recent years, with the widespread adoption of cloud technology, 5G, and Internet of Things (IoT), the number of people using the Internet has increased; and more and more smart devices, such as smartphones and mobile homes, are connected to the Internet [1]. Additionally, various applications and services, such as social media and online videos [2], have become available to users. All of these factors have contributed to the tremendous growth in Internet traffic in the big data era [3].

Network security and performance are challenged by massive amounts of network traffic. The rise in traffic volume has led to an increase in the amount of sensitive data being transmitted, which in turn has necessitated the use of security measures such as encryption technology [4]. However, an increase in traffic also means an increase in potential attacks such as spam and malware [5], which can cause network failure. With the widespread adoption of encryption protocols such as SSL/TLS, it has become increasingly challenging to identify malicious activities hidden within massive network traffic. Moreover, in the past couple of years, the impact of the COVID-19 pandemic has led to a significant increase in online activities, resulting in a higher frequency of VPN and Tor tunneling encryption technologies being used. The need for automated traffic analysis tools and techniques has increased in response to the complex network environment [6].

Traffic classification, as one of the key functions of an automated network intrusion detection system, is essential for maintaining the security and stability of a network environment. Traffic classification enables network administrators to control specific network traffic by identifying its sources and destinations, which can prevent the inappropriate use of the network and the leakage of sensitive information [7]. Furthermore, this technology can assist with identifying potential security risks and block or restrict malicious traffic [8,9], thereby ensuring the integrity and dependability of the network.

Internet traffic identification techniques can be broadly categorized into four types: port-based, packet-payload-based, behavior-based, and flow-based [10]. The port-based approach [11] used to be one of the common measures of traffic identification. This method of identifying applications by looking up the corresponding port number in the list of port numbers published by the Internet Assigned Numbers Authority (IANA) is gradually losing its usefulness in scenarios of incomplete port assignments or dynamic port allocation. Packet-payload-based traffic classification techniques [12] gather payload features from network packets and match them with existing attribute identification databases to classify the traffic. However, the payload may contain limited information, making it challenging to accurately classify all forms of traffic. Additionally, the utilization of encryption hinders the examination of network packet contents. Behavior-based network traffic identification [13] utilizes the examination of device or user behavior to identify and categorize network traffic. This method offers a more in-depth understanding of network traffic compared with other techniques such as port-based or packet-payload-based identification. Additionally, it faces challenges in managing the complexities of various endpoints and users and requires more computational power. Flow-based traffic classification techniques [14] concentrate more on the network traffic itself. Currently, the common approach is to utilize feature engineering, which involves preprocessing and organizing raw data, extracting various features to meet specific task objectives, and utilizing specially designed algorithms for recognition. The two main steps in flow-based traffic classification techniques involve feature selection and extraction, followed by model designing and training [15]. As encrypted traffic becomes increasingly prevalent, many approaches are now utilizing statistical features to train classifiers for classification using machine learning and deep learning. Statistical features, such as flow length, average packet length, flow start and end times, etc., are often calculated for the entire data stream [16]. Relationships between multiple packets, such as minimum interpacket delay and packet up/down correlation properties are also considered [16]. The design and construction of these statistical features require expertise in related fields. Deep learning approaches [17,18], on the other hand, use the concept of representation learning to eliminate the need for manual feature engineering. They automatically extract features that are closer to the actual traffic from the encrypted raw data, allowing for better differentiation of traffic classes. Therefore, deep-learning-based feature extraction techniques have been extensively deployed for the identification of traffic in networks.

Numerous studies have been conducted on deep-learning-based traffic identification, and the identification accuracy of these systems is satisfactory. However, as the network environment has changed so rapidly, more and more flaws in the current traffic identification systems have come to light. Currently, the biggest issue is that network traffic identification is carried out in closed sets, meaning that all potential classes in a classification task are known at the time of training. Existing detection systems cannot correctly identify new types of traffic when they arrive. These new unknown types of traffic are often misclassified as known traffic categories, resulting in a high rate of false positives in the detection system. Many unknown attacks, such as zero-day attacks and new variants of malware, generate malicious traffic that can exploit the vulnerability to evade detection by traffic monitoring systems, posing a serious threat to network security. Therefore, the unknown traffic identification problem, also known as the open-world traffic identification or zero-day application identification problem [19], needs to be studied.

The process of identifying unknown traffic in a detection system can generally be divided into three stages. Firstly, the system separates known traffic from unknown traffic, thus achieving an accurate classification of known traffic. Secondly, the system detects new classes from the separated unknown traffic, labels the recognized new classes, and adds them to the known classes. Finally, there is a phase of incremental learning, where the previous model is updated based on the updated known class dataset. Due to the unlabeled characteristics of unknown traffic, most current research in this area tends to use unsupervised machine learning models for the recognition of unknown traffic.

Existing unknown traffic recognition algorithms have the following limitations: In terms of traffic feature selection, the existing methods usually need to undergo complex and time-consuming feature engineering. Feature selection is highly dependent on domain expert experience, and the type of features selected is relatively single, resulting in low identification efficiency. Another problem is that the entire system needs to be retrained every time new traffic data are collected, which requires a lot of time and effort, causing the system to have poor real-time performance and utility. As a result, these methods are not well suited for real-time applications and are primarily used for offline data classification, making it difficult to meet the demands of modern intelligent network supervision.

In this study, we combined the benefits of automatic feature extraction and deep learning to design and implement an algorithm for classifying unknown traffic. This algorithm is tailored to meet the requirements of current network identification tasks and not only identifies known traffic but also distinguishes unknown traffic in real time with high identification accuracy. The main contributions of this paper are as follows:

We propose an intelligent feature processing method that uses a multiple-channel parallel neural network to extract temporal and spatial features from raw network traffic data, combined with feature fusion based on the mRMR algorithm, which achieves a good balance between accuracy and time consumption in traffic detection.
We present a clustering approach based on the density ratio to distinguish between known and unknown traffic features and construct new classes for unknown traffic. The method can dynamically expand the clustering results by adding new data, thus improving efficiency and handling noise.
We improve an incremental learning multi-class SVM classifier that autonomously learns based on the features of known traffic and detected unknown traffic, without the need to train the classifier from scratch each time.
We establish an incremental learning model for unknown traffic identification and validate its performance on the public datasets ISCX-VPN-Tor, NSL-KDD, and a self-coself-collected dataset SelfDataset.

The remainder of this paper is structured as follows: The study background and related studies are given in Section 2, with an emphasis on the commonly used techniques for identifying unknown network activity. Our suggested framework for an incremental unknown traffic detection model is detailed in Section 3. The setup and results of the experiment are described and discussed in Section 4. Finally, the entire paper is summarized in Section 5.

2. Related Work

In the context of encrypted traffic analysis [20], we briefly introduce the machine learning methods used for traffic classification in this section, which include methods for detecting known traffic and methods for unknown traffic.

2.1. Known Internet Traffic Identification

As we already stated in Section 1, the traditional port-based and packet-payload-based traffic detection methods are not suitable for encrypted traffic classification because of simple matching rules. Current research on encrypted traffic has mainly focused on the deep learning field. Network structures such as convolutional neural networks (CNNs), with outstanding performance in image classification; recurrent neural networks (RNNs), suitable for natural language processing; and knowledge graphs, with excellent visualization performance, have been applied in the field of traffic processing.

Ref. [21] first presented the application of an end-to-end method in the field of encrypted traffic classification. The method utilizes a one-dimensional convolutional neural network (1D-CNN) to extract and select traffic features, integrating feature extraction, feature selection, and classifier into a unified end-to-end framework. The paper demonstrated that 1D-CNN is more suitable for encrypted traffic classification tasks than 2D-CNN as network traffic is essentially one-dimensional and continuous data streams, taking full advantage of the strengths of 1D-CNN.

Tree-RNN [22] integrates multiple end-to-end small classifiers to form a large tree-form classifier. The individual small classifiers complement each other in terms of performance, overcoming the problem where a single classifier is less accurate in some categories. Tree networks can automatically learn the nonlinear relationships between input and output data without extracting features.

CNNs and RNNs can also be used in combination. App-Net, proposed in [23], learns a joint flow-app embedding by feeding TLS traffic into a 1D-CNN and a bi-LSTM in parallel to describe traffic sequence patterns and unique application signatures. The method achieved 96.41% accuracy and a 91.89% macro-F1 score on a real-world dataset covering 65 applications.

Since both CNN and RNN learning and training depend on labels, semisupervised or unsupervised learning methods have been proposed to cope with poorly labeled or unlabeled data. For example, ByteSGAN [24] only needs to use a small amount of tagged traffic to achieve the goal of traffic classification in a fine-grained manner. MT-FlowFormer [25] is an efficient classifier with an attention mechanism that extracts features from flow sequences at low computational cost and a mean-teacher-style semisupervised framework to exploit unlabeled flow data. Ref. [26] applies an improved k-means algorithm to traffic classification, effectively overcoming the problem of clustering results depending on the value of k, and analyzes the impact factors of clusters. The clustering-based quadratic network (CQNet) proposed in [27] applies metric learning to the classification problem of encrypted traffic of decentralized applications (DApps), reducing the redundancy in the training dataset and thus improving the efficiency of the classifier. Ref. [28] implemented an unsupervised anomaly detection method using a three-layer autoencoder that achieved an F1-measure of 95%, which was also competitive with supervised learning algorithms.

The above methods provide effective solutions for encrypted traffic detection from different perspectives. However, considering the more complex network environment in the real world, the traffic classes to be identified are more abundant and various, and there are often unknown traffic types that do not exist in the training set. When testing traffic samples of unknown categories, the above methods might misclassify them into known categories, resulting in a high false positive rate, significantly affecting the accuracy of the classifier.

2.2. Unknown Internet Traffic Identification

To address the challenges posed by the closed-set assumption described above, researchers have combined the encrypted traffic classification problem with open-set recognition to detect unknown encrypted traffic.

There are currently several main ideas to solve the unknown traffic identification problem: supervised, unsupervised, and semisupervised learning methods.

Traditional supervised neural networks are unable to directly handle unknown traffic categories due to limitations in their model structure and training methods, which restrict their generalization ability. The classifier’s fixed number of categories makes it impossible to reserve input space for unknown categories. During testing, when unknown samples appear, the SoftMax layer of the model assigns a high probability to the closest known training class, resulting in the unknown sample being classified to the nearest known category. Some researchers [29] attempted to add an unknown category node in the SoftMax layer, but this requires providing some unknown samples as negative examples during training. Doing so may affect the model’s classification performance on known categories, and it cannot cover all possible unknown samples. Some researchers have proposed identifying unknown traffic by improving the output layer of the neural network, mainly based on confidence or backpropagation gradients. For example, Zhang et al. [30] proposed a network intrusion detection method based on open-set identification and investigated how extreme value theory (EVT) can be applied to unknown attack detection systems. The method establishes an Open-CNN model through introducing an OpenMax layer instead of the traditional SoftMax layer, which fits the postidentification activation of known classes to a Weibull distribution. Then, the pseudo-probability of unknown classes can be estimated from the activation scores of known classes by recalculating the activation in the penultimate layer, enabling the detection of unknown attacks. Ref. [31] also used the Open-CNN model. Meta-recognition theory was used to recalculate the activation vectors of the second-to-last layer, thus estimating the probability of unknown categories. In addition, active learning methods were combined, and a minimum confidence query strategy was used to select a subset of detected unknown attack samples for labeling and retraining. This method improves the ability to identify unknown attacks by more than 9% compared with the previous method. Xia et al. [32] proposed a gradient-based additive angular margin loss (ArcFace) model, GMAF, which uses the first backpropagation of the ArcFace loss layer weight gradient to distinguish known from unknown traffic. In addition, the additive angular margin was added to the category balance focal loss, which improved the separability of categories. Ref. [33] integrated random forest, Adaboost, gradient boost, and XGBoost to build an ensemble classification framework Cor-ENTC to classify VPN traffic entering the SDN environment. The scheme is highly efficient in communication, significantly more accurate than a single classifier, and can also handle unknown applications. Although these improved methods can identify unknown traffic to a certain extent, they all ignore the key part of unknown traffic during training, which, to some extent, reduces performance, and the overall accuracy is poor in the mixed detection of known and unknown traffic. In addition to this, existing methods have a drawback in that most of the proposals are burdened with unnecessary difficulty in algorithm design to synchronize the classification of known traffic and the recognition of unknown traffic. In reality, we already have many well-performing network traffic classification methods. Therefore, separating the classification task from the recognition task would be an improvement. The existing classifiers can be used to separate known traffic from unknown traffic, focusing on the classification of known traffic to improve classification accuracy. However, for unknown traffic, only preliminary extraction of unknown samples can be performed, without fine-grained classification. Further operations are required to classify the separated unknown traffic in more detail.

Unknown encrypted traffic in the real world does not contain prior information such as labels. In this case, unsupervised learning methods are more applicable for detection. Ref. [34] used a lightweight structure with a two-layer architecture of random forest and SVM and a voting mechanism to achieve mixed-traffic classification including a large amount of unknown traffic with a low model update cost. Clustering algorithms are also commonly used to identify unknown Internet traffic. Zhao et al. [35] proposed an unsupervised unknown traffic identification framework based on n-gram embeddings, a deep autoencoder, and clustering algorithms. The n-gram embeddings strategy converts traffic data into high-dimensional vector representations, capturing both local and global features of traffic. The use of deep neural networks enhances the discriminative power of traffic. Additionally, the unsupervised classification of traffic vectors is achieved using algorithms such as k-means and spectral clustering, resulting in high clustering purity. Zhang et al. [36] proposed another unsupervised framework based on the MPCKMeans algorithm, called DePCK, to improve the classification of mixed unknown traffic. This framework fully utilizes traffic correlations to guide the process of pairwise constrained clustering. Extracting and learning unknown features from unlabeled data is the main advantage of unsupervised methods for unknown traffic classification. However, mapping the extracted clusters to limited application types is a challenging task in the absence of prior information.

Semisupervised clustering adds a portion of labeled data to the training process to solve the mapping problem of traffic clustering to application categories. For example, in ref. [37], the authors extracted features from encrypted packet headers and payloads using a CNN and an RNN and employed a self-organizing map (SOM) to cluster the extracted features into different classes. Another example of semisupervised learning method for unknown traffic detection was proposed by [38], which uses a variational autoencoder (VAE) to learn latent representations from encrypted packet data. A Gaussian mixture model (GMM) was used to cluster the latent representations into different components based on their probability distribution. Additionally, they developed an open-source platform called OpenCBD that can identify and analyze unknown protocols in network flows. The model proposed in [39] also comes with a self-labeling process after clustering, which can provide accurate labels for packets of unknown categories, thus creating new datasets and supporting autonomous learning updates of the classifier. Ref. [40] proposed a data flow classification method based on an improved Harris Eagle algorithm combined with fuzzy C-means clustering. The method maps data flow samples to Harris Eagle population individuals, finds the optimal position through several iterations of the improved Harris Eagle optimization (IHHO) algorithm, and uses it as the initial clustering center to guide data flow classification by clustering according to the maximum membership principle, which improves the accuracy and stability of classification to some extent. Ref. [41] classified traffic flows under the protocol level. The researchers designed a dual-path autoencoder combining a convolutional autoencoder and a depth autoencoder for feature extraction and aggregation, then aggregating unknown traffic flows into multiple high-purity classes via correlation-adjusted clustering. The method achieves more than 98% classification accuracy on a predetermined number of 60 clusters. However, general clustering methods require knowledge of the number of clusters and need to be reclustered whenever a new class appears, which does not make good use of the previous clustering results. This leads to low computational efficiency and limits scalability.

To overcome the shortcomings of existing unknown traffic identification methods, we aimed to establish a hybrid detection model in this study that not only enables closed-world traffic classification but also detects unknown application traffic. The model autonomously learns based on the knowledge of existing datasets and detects unknown traffic categories without requiring training of the classifier from scratch each time.

3. Design of Framework

This paper suggests a framework that combines a data preprocessing algorithm, a feature extraction and fusion algorithm, unknown traffic recognition, and an incremental learning algorithm to achieve the identification of unknown types of traffic in a closed set. Figure 1 depicts the model architecture suggested in this paper.

3.1. Data Preprocessing

The flow–image transformation method [42], which was used in our previous study for raw traffic features, is utilized in the data preprocessing stage. Data preprocessing converts raw traffic data from PCAP packages to IDX file format through three steps: traffic split, traffic clean, and image conversion. We split traffic packets containing the same five-tuple information as a flow using SplitCap so that each small PCAP file contains a single TCP or UDP session. In order to lessen the subsequent workload, the traffic cleaning step removes packets that are not useful for traffic classification, such as short sessions with insufficient payload size and some auxiliary packets, such a Domain Name Service (DNS) packets for host name resolution. As MAC and IP addresses cannot be used as training features for traffic classification and can obstruct the feature extraction process of the neural network, which can result in model overfitting, traffic anonymization is also carried out in this step. Each data-cleaned PCAP file is cropped into a group of 1024 (

32 \times 32

) bytes, each of which is represented as a pixel, and is then converted into a single-channel grayscale image of size

32 \times 32

. These pictures show the natural characteristics of the traffic. It is evident that the various traffic image types can be easily distinguished from one another. To provide data to the deep learning model, the file containing all pixel sequences is transformed into an IDX format file. Through the preprocessing process, we convert the raw traffic data into images, facilitating the application of advanced techniques in image processing to traffic processing.

3.2. Feature Extraction and Fusion

In this study, we extract ed features from the preprocessed traffic using three pretrained models. In order to extract spatial information, we use two CNN models, AlexNet and VGG16, and LSTM architecture is used to learn temporal features. The top 100 deep features for each neural network architecture were chosen after the collected deep features were downscaled using the minimum redundancy maximum relevance (mRMR) algorithm. The mRMR feature selection method produces the features, which are then combined and input into the classifier.

3.2.1. Spatial and Temporal Feature Extraction

CNNs, as a biologically inspired model, have a classical structure consisting of convolutional, pooling, and fully connected layers, allowing them to learn sophisticated representations of the input data and extract significant features from them, reflecting the essential patterns in the data. This is why they are frequently utilized and achieve outstanding results in demanding tasks such as object detection [43] and semantic segmentation [44]. AlexNet and VGG16 have both achieved state-of-the-art results in image classification tasks and have subsequently become popular benchmarks for evaluating new CNN architectures. The two different CNN networks were employed in this study with the network structure shown in Figure 2, and they were used to automatically extract the spatial features of the original traffic from the input images.

AlexNet [45] is a seminal deep learning model, introduced in 2012, that earned the top place in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark contest for image classification, in that year. AlexNet uses a relatively simple convolutional and pooling layer structure, which makes the structure and parameters of the network easy to understand and interpret, and the deep network structure in turn makes it excellent for image feature extraction. We trained an AlexNet model without a local response normalization layer in this study. It had 5 convolutional layers and 3 fully connected layers with an input image size of

32 \times 32 \times 1

. Each convolutional layer convolves the input with the convolutional kernel and then maps the output features via the activation function. In our model, the filter size of the convolutional layer is

5 \times 5

pixels, and the step size is 1. The ReLU function is the activation function of the convolutional layer. The first, second, and fifth convolutional layers are connected to the maxpooling layer afterward, which can effectively reduce the number of parameters and complexity of network operation. The filter size of the max pooling layer was designed to be

3 \times 3

pixels with a stride of 3. The third and fourth convolutional layers are then directly connected. As for the fully connected layers, the first 2 of which have 4096 neurons, and the dropout technique is added to avoid overfitting during training. The third fully connected layer has 1000 neurons, corresponding to the extraction of 1000 feature vectors

F_{A l e x N e t}

, where each feature vector has 256 dimensions. Each convolutional layer and fully connected layer is followed by a batch normalization layer to solve the problem of unstable output in deep neural networks.

F_{A l e x N e t} = {F_{1}, F_{2}, \dots, F_{1000}}

(1)

F_{i} = {(f_{1}, f_{2}, \dots, f_{256})}^{T}, i \in [1, 1000]

(2)

VGG16 [46] is another popular CNN architecture, consisting of 13 convolutional layers, 5 pooling layers, and 3 fully connected layers. This deep structure allows VGG16 to better capture details and complex features in images, thus improving the accuracy of feature extraction. Like AlexNet, VGG16 also has 5 sets of convolutions, but each set contains more convolutional layers and all convolutional layers use

3 \times 3

convolution kernels. Each of the first two convolutional sets contains two convolutional layers, and each of the third, fourth, and fifth sets contains three. The difference from AlexNetis that a

2 \times 2

max pooling is performed after each set of convolutions in VGG16. By continuously using multiple stacked small-sized kernel filters instead of large-sized filters, VGG16 is able to learn more complex features but, at the same time, the multiple nonlinear layers increase the depth of the network and the number of parameters. Therefore, to alleviate the issue of excessive parameter count, VGG16 eliminates the batch normalization (BN) layers after the convolutional layers. Additionally, the parameters are shared between different layers, which reduces the number of parameters in the network, makes the model more compact and efficient, and helps prevent the problem of overfitting. As for the fully connected layers used for deep feature extraction, they are configured the same as in AlexNet. VGG16 gradually reduces the size of the feature map by stacking convolutional and pooling layers several times, thus producing a smooth feature representation. This representation is useful for subsequent classification tasks to better distinguish between the different classes of traffic images. The output of the VGG16 network channel is also 1000 256-dimensional feature vectors.

F_{V G G 16} = {F_{1}, F_{2}, \dots, F_{1000}}

(3)

AlexNet and VGG16 are used in parallel and complement each other to extract sufficient spatial traffic features from images for classification. In addition to single packet features, the temporal features between the previous and subsequent packets significantly contribute to the classification of traffic. Due to the lack of memory function in CNN networks, they cannot capture the temporal features contained in the contextual relationships between packets and flows in traffic data. RNNs perform exceptionally well in handling these features. Long short-term memory (LSTM ) [47] is a variation of RNN that has a time-sequential structure. It retains the advantage of RNN in capturing contextually correlated information while effectively overcoming the gradient vanishing and exploding problems. LSTM can retain the information of time sequences through internal memory units and gating mechanisms, its output at a certain time is related not only to the input of the current time but also to the input and state of the previous time. In this study, we used LSTM structures to extract these time-dimensional features and combine them with space-dimensional features extracted by AlexNet and VGG16 as metric features to better process traffic data. The temporal features extracted by the LSTM are also 1000 256-dimensional vectors, which are defined as follows:

F_{L S T M} = {F_{1}, F_{2}, \dots, F_{1000}}

(4)

3.2.2. Feature Selection and Fusion

Many features are extracted from deep neural networks and may contain many redundant elements. Given that our research application is mainly focused on encrypted traffic, where there is more redundant information than normal traffic, it affects the accuracy of subsequent classifiers, so it is necessary to select the extracted deep features.

We adopted the minimum redundancy maximum relevance (mRMR) algorithm [48] to select the features that have the least redundancy and the most impact on the target from a large number of features. It evaluates the importance of each feature by calculating the correlation of each feature with the target feature and the independence of other features. Finally, based on the evaluation results, the features are selected from high to low until the required number of features is reached. In other words, this method can effectively balance relevance and redundancy to sort the feature set. The specific calculation procedure for feature selection using the mRMR algorithm is as follows:

In this study, we used mutual information as a criterion to measure the redundancy between features and the correlation between features and class variables, which was calculated using Equation (5), where X and Y are two different types of variables;

p_{1} (x)

and

p_{2} (y)

are the respective marginal probability distribution functions; and

p (x, y)

is their joint probability distribution function.

I (x, y) = \sum_{x \in X} \sum_{y \in Y} p (x, y) l o g (\frac{p (x, y)}{p_{1} (x) p_{2} (y)})

(5)

Our goal is to find a feature subset S of the deep feature set F. First, we initialize a set S. For each feature

F_{i}

in F, we calculate its correlation W with the target category c using mutual information

I (F_{i}, c)

, as shown in Equation (6). The redundancy is calculated using the average of the mutual information between features, as shown in Equation (7).

W = \frac{1}{| F |} \sum_{F_{i} \in F} I (F_{i}, c)

(6)

R = \frac{1}{{| F |}^{2}} \sum_{F_{i}, F_{j} \in F} I (F_{i}, F_{j})

(7)

We want the correlation W between features and classes to be as large as possible and the redundancy R between features to be as small as possible. Therefore, the features can be sorted by the simple combinations of the two conditions, as indicated in Equation (8). The features that satisfy the following formula are selected and stored in the set S.

m a x {W - R}

(8)

We keep repeating the above steps until the set S has the required number of elements or ends the loop when it is identical to the elements in F.

In our proposed architecture, 1000 features of each deep neural network channel are extracted as the input to the mRMR algorithm, and the number is reduced to 100 after filtering. The selected feature representations are shown in Equations (9)–(11).

F_{A l e x N e t}^{'} = {F_{1}^{'}, F_{2}^{'}, \dots, F_{100}^{'}}

(9)

F_{V G G 16}^{'} = {F_{1}^{'}, F_{2}^{'}, \dots, F_{100}^{'}}

(10)

F_{L S T M}^{'} = {F_{1}^{'}, F_{2}^{'}, \dots, F_{100}^{'}}

(11)

These selected features are next merged and stored in the feature database. The fused feature vector

F_{F u s i o n}

is a simple concatenation of the selection vectors of the three parallel channels AlexNet, VGG16 and LSTM and is represented as follows:

F_{F u s i o n} = F_{A l e x N e t}^{'} \oplus F_{V G G 16}^{'} \oplus F_{L S T M}^{'}

(12)

3.3. Unknown Traffic Recognition

A trained SVM multi-classifier is used to recognize the extracted and fused features from the previous step. We first construct the SVM classifier with

N + 1

classes. If the feature belongs to a known traffic of class N in the training set, it is classified into a specific traffic class; otherwise, it is classified as unknown traffic and enters the unknown traffic identification process. The purpose of this process is to separate known traffic from unknown traffic to improve the accuracy of known traffic identification.

Unlike the feature distribution pattern of known traffic, unknown traffic has certain similarities among similar features, but its spatial distribution is relatively random, and the shape of clusters is irregular, so it is obviously not feasible to classify unknown features using fixed distance alone. In order to better adapt to the characteristics of the distribution of unknown traffic features, we designed a density-ratio-based clustering that can classify unknown traffic in real time. The execution process of the algorithm is shown in Algorithm 1.

In our scenario, clustering methods that require prior setting of the number of clusters are not applicable because the unknown features do not have labels, and the number of categories for the unknown traffic is unknown. Density-based clustering algorithms can identify clusters of any shape and size in datasets with noise, but due to the use of a global density threshold, it is often difficult to identify all clusters in datasets with large density variations. We modified density-based clustering algorithms using the density ratio, which calculates the ratio of the density of a core point and of its neighborhood. In the algorithm proposed in this section, we divide the unknown features with sufficient density ratio into an unknown class. The feature points located in a very small neighborhood

ϵ

of a feature

f_{U}^{i}

are define as

N_{ϵ} (f_{U}^{i})

, those in a larger neighborhood

μ

are calculated as

N_{μ} (f_{U}^{i})

in the same way, and the ratio of the number of the two is the density ratio

\frac{| N_{ϵ} (f_{U}^{i}) |}{| N_{μ} (f_{U}^{i}) |}

.

Algorithm 1: Unknown traffic classification

When the density ratio of a feature is greater than the set density threshold

τ

, a new unknown class

C^{'}

is created for it. The feature points that meet the requirements are selected in the same density ratio screening method within the neighborhood of the feature

f_{U}^{i}

and added to the newly created unknown class. That is to say, an unknown class is the largest set of features connected by the density ratio.

If the neighborhood density ratio of the unknown feature does not reach the specified threshold, the distance is compared with the average feature of each known traffic category. If it is less than the distance threshold T, it is judged to be an unknown feature of a known class, and the corresponding class label

C l a s s_{l}

is output; otherwise, it is output as a noise point N.

3.4. Incremental SVM Classifier

The features of the unknown traffic detected by the algorithm are also stored in the feature database, which poses higher demands on our classifier. It needs to gradually learn from the updated feature library, becoming a new multi-classifier capable of recognizing a wider range of application types.

The essence of SVM is the estimation of the optimal decision function. In N-classes classification problems, the goal is to find the optimal

N - 1

hyperplanes. We first consider the classification of known traffic using SVM. The mathematical description of the known feature database

F

is shown in Equation (13), which contains n features, each with a label that belong to one of the N traffic classes. Equation (14) demonstrates the optimization objective function of a multi-class SVM based on the Lagrange equation, where

w^{l}

is a one-versus-rest hyperplane vector, b is bias,

e^{i l}

is a slack variable,

Φ (\cdot)

denotes the kernel function, and

ψ

denotes a penalty factor.

\begin{matrix} \begin{matrix} F = & f_{K n o w n} \\ = & {(f_{K}^{i}, y_{K}^{i}) | i \in [1, n], \\ y_{K}^{i} \in {C l a s s_{1}, C l a s s_{2}, \dots, C l a s s_{N}}} \end{matrix} \end{matrix}

(13)

\begin{matrix} \begin{matrix} min_{w, b, e} W (w, b, e) = \frac{\sum_{l = 1}^{N} {∥ w^{l} ∥}^{2}}{2} + ψ \sum_{i = 1}^{n} \sum_{l = 1}^{N} {(e^{i l})}^{2} \\ s . t . y^{i} [{(w^{l})}^{T} Φ (x^{i}) + b^{l}] = 1 - e^{i l} \end{matrix} \end{matrix}

(14)

In the current state s, the weight vector

w_{s}

and bias vector

b_{s}

of the hyperplane are described in Equations (15) and (16), respectively.

\begin{matrix} w_{s} = [w_{s}^{1}, w_{s}^{2}, \dots, w_{s}^{N}] \end{matrix}

(15)

\begin{matrix} b_{s} = [b_{s}^{1}, b_{s}^{2}, \dots, b_{s}^{N}] \end{matrix}

(16)

When the state comes to

s + 1

, it indicates that the features of a new traffic category are added to the feature database; at this time, the feature database is updated to

F^{'}

, where the number of features changes to

N^{'}

. After the feature library is updated, the features of the existing categories in the previous state may increase, so the hyperplane of the classifier needs to be adjusted. Moreover, a new classification plane should be added in order to learn the features of the new category. The weight vector and bias parameter of the hyperplanes in the state

s + 1

are updated as given in Equations (17)–(20).

\begin{matrix} w_{s + 1}^{l} = w_{s}^{l} + Δ w \end{matrix}

(17)

\begin{matrix} b_{s + 1}^{l} = b_{s}^{l} + Δ b \end{matrix}

(18)

\begin{matrix} w_{s + 1} = [w_{s + 1}^{1}, w_{s + 1}^{2}, \dots, w_{s + 1}^{N}] \cup w_{s + 1}^{N + 1} \end{matrix}

(19)

\begin{matrix} b_{s + 1} = [b_{s + 1}^{1}, b_{s + 1}^{2}, \dots, b_{s + 1}^{N}] \cup b_{s + 1}^{N + 1} \end{matrix}

(20)

In summary, the optimization function of the incremental multi-class SVM is improved, as shown in Equation (21), where

ξ_{s + 1}

is the weight of the old classification plane, and

λ_{s + 1}

is the knowledge weight of the old feature base. The classifier is able to learn new knowledge based on the classification of known types of traffic using the existing model.

\begin{matrix} \begin{matrix} min_{w, b, e} W (w_{s + 1}, b_{s + 1}, e_{s + 1}) \\ = & \frac{\sum_{l = 1}^{N} {∥ w_{s + 1}^{l} - w_{s}^{1} ∥}^{2}}{2} \\ + \frac{\sum_{l = 1}^{N} {∥ w_{s + 1}^{N + 1} - ξ_{s + 1} w_{s}^{1} ∥}^{2}}{2} \\ + ψ_{s + 1} {(\sum_{r = 1}^{N^{'}} e_{s + 1}^{r l})}^{2} + λ_{s + 1} \sum_{i = 1}^{N} {(e_{s + 1}^{i l})}^{2}) \end{matrix} \end{matrix}

(21)

4. Evaluation

4.1. Datasets for Evaluation

In real networks, new applications frequently produce unknown flows, particularly when encryption is involved. We used two publicly accessible datasets, ISCXVPN2016 and ISCXTor2016, and extracted some of the data to construct the dataset ISCX-VPN-Tor as experimental datasets to evaluate the performance of the proposed approach under different encrypted traffic scenario tasks. In the ISCXVPN2016 dataset, each class of traffic is encapsulated by regular encryption and VPN protocols. The traffic in the ISCXTor2016 dataset is encrypted by Tor technology. Both datasets were published by the University of New Brunswick, who created accounts to run a number of representative applications (such as facebook, skype, spotify, and gmail) on virtual machine workstations that were connected to the Internet through a gateway virtual machine, which in turn routed all traffic through the Tor network or VPN tunnel. Flows were generated from PCAP files captured at the gateway and labeled according to the applications executing on the workstations. The traffic in each dataset includes six application categories, namely Email, Chat, Streaming, File Transfer, VoIP, and P2P. Therefore, a total of 12 application categories of traffic are included in the experimental dataset, as detailed in Table 1. Dataset ISCX-VPN-Tor represents real-world VPN tunnel and Tor network traffic to some extent.

Another intrusion detection dataset, NSL-KDD, was used to validate the detection performance of the proposed method for unknown attacks. NSL-KDD contains over 96,000 flows, including normal traffic and 22 attack patterns, namely DoS attacks, probe attacks, U2R attacks, and R2L attacks, as detailed in Table 2. This dataset was synthesized by the traffic generator, which is not fully representative of the traffic in a real environment but has become one of the relatively authoritative intrusion detection datasets in the field of network security due to its suitability for studying the characteristics of various attack classes. From the statistics of the number of flows of various traffic categories of NSL-KDD, it can be seen that there is a traffic imbalance problem in this dataset, with normal samples accounting for most of the attack traffic, and DoS dominating the attack traffic, with U2R attacks accounting for a small percentage, which is a problem we needed to consider.

In addition, we constructed a self-collected dataset, SelfDataset, by capturing traffic generated by commonly used scientific software as normal traffic and malicious traffic from malicious samples running in a sandbox. These malicious samples were also captured in real-world scenarios. A demo of the SelfDataset was used for conducting experiments, with a size of 4.18 GB, as shown in Table 3.

Some applications in these datasets were chosen as the unknown classes, and the rest of the dataset was the known classes. We used 80% of the known samples to train the model and 10% of the known samples as the validation set to adjust the parameters of the model through the deviation from ground truth and improve its prediction performance. Furthermore, the remaining 10% of the known and all unknown samples was used as the test set to evaluate the proposed method.

4.2. Experiment Settings

The testing and comparative experiments of the model were compiled with the support of a Nvidia Tesla T4 GPU. They were run on a Ubuntu 20.04 LTS operating system and implemented using Python 3.6.5. The main modules used in this experiment were TensorFlow 1.14.0 and sklearn 1.0.2.

The workflow of the model is as follows: Python captures and parses the network traffic. The parsed traffic is collected by Kafka and sent to the pre-trained feature extraction and fusion module for processing. The results are then stored in a MySQL server as a feature database. The SVM classifier retrieves the traffic features from the MySQL database for classification. Unknown traffic features are clustered using the density-ratio-based clustering algorithm to form new categories, and labeled data are stored in the MySQL database. During this process, the classifier dynamically adjusts and updates. The classification results of the traffic can be visualized through a frontend interface. The pretraining of the model is completed on a GPU server, and the online model is deployed in Docker using TensorFlow Serving.

Pretrained deep models are often adapted to a new task by means of transfer learning. The parameter values of the pretrained neural networks used in the experiments are given in Table 4. Each neural network had an input image size of

32 \times 32

, the models were optimized using stochastic gradient descent (SGD) with a momentum of 0.9, a decay of 10⁻⁴, a mini-batch size of 32, and a learning rate of 0.001. The pretraining process of the model was able to be completed with high accuracy. However, the validation process of the pretrained model showed oscillation loss, indicating that the model could not fully converge. That is, the pretrained model could not fully learn the local features of the traffic. Figure 3 shows how the loss function changes when the model is pretraining. During the training process, the loss function gradually decreased until it reached a slower convergence speed after 80 steps. After 200 steps, it tended to converge. This demonstrates that the model has a good training speed.

In order to measure the performance of the models, accuracy (

A c c

), true positive rate (

T P R

), false positive rate (

F P R

),

F T F

, and F1 score (

F 1

) metrics derived from the confusion matrix were used, and the formulations of the metrics are described in Equations (22)–(26). In the experiment, the unknown traffic flow was considered as positive samples. Thus,

T P

(true positive) and

F P

(false positive) in the equations are the number of unknown flows that were correctly determined to be unknown and incorrectly determined to be known, respectively; and

T N

(true negative) and

F N

(false negative) are the number of known flows that were correctly identified as known and incorrectly identified as unknown, respectively. As can be seen from the equations,

A c c

characterizes the proportion of all flows that are correctly classified,

T P R

indicates the proportion of all unknown flows that are correctly predicted as unknown, and

F P R

indicates the proportion of all known flows that are incorrectly predicted as unknown.

F T F

takes both

T P R

and

F P R

into account, and

F 1

is a metric that combines the precision and recall rate of a model. The

k a p p a

factor is also a measure of model consistency based on a confusion matrix that is calculated as shown in Equation (27), where

p_{o}

represents the overall classification accuracy, and

p_{e}

refers to the expected agreement rate. To evaluate the performance of different classification algorithms, we also plotted ROC curves and compared their

A U C

values.

A c c = \frac{T P + T N}{T P + F P + F N + T N}

(22)

T P R = \frac{T P}{T P + F N}

(23)

F P R = \frac{F P}{F P + T N}

(24)

F T F = \frac{T P R}{1 + F P R}

(25)

F 1 = \frac{2 \cdot T P}{2 \cdot T P + F P + F N}

(26)

k a p p a = \frac{p_{o} - p_{e}}{1 - p_{e}}

(27)

4.3. Selection of Feature Number and Threshold

The number of features extracted per neural network channel in the feature extraction and fusion module and the threshold value in the unknown traffic classification algorithm are significant parameters that affect the performance of our model. In this section, we discuss how to select the values of the two parameters.

We first considered the effect of the number of features extracted by the mRMR algorithm on the recognition effect, and the results are shown in Figure 4. We set the number of features extracted from each neural network channel to be the same, so the number of features after fusion is a multiple of 3. As can be seen from Figure 4, the accuracy results are better when each neural network extracts and selects the optimal first 100 features with the mRMR algorithm, that is, when the number of fused features is 300. When the number of fused features is over 360, the accuracy of model detection gradually decreases, which we analyze may be caused by the problem of feature redundancy in encrypted traffic, and the excessive number of features might cause interference in traffic classification.

The threshold value in the unknown traffic detection algorithm also has an important impact on model performance. As for the threshold selection problem, we conducted the following experiments.

We calculated the distribution of the distance between the features extracted from all classes of traffic and the average known features, as shown in Figure 5, where the red bars represent known features, the blue bars represent unknown features, the vertical coordinates show the number of features, and the horizontal coordinates indicate the Euclidean distance from the average known features. It can be seen that the known features are close, whereas most of the unknown features are farther away, and the two features can be better distinguished when the distance is between 1.0 and 1.5. Therefore, we chose a threshold value of 1.3.

4.4. Necessity of Multiple-Channel and Feature Fusion

In order to demonstrate the effectiveness of the multiple-channel architecture and the necessity of the feature fusion method, we conducted 10 experiments on the extracted test set using different architectures. The compared architectures included single-channel AlexNet, single-channel VGG16, single-channel LSTM, and three-channel parallel network structures, where each single-channel network is directly connected to the classifier without feature selection and fusion. The multiple-channel structure is further divided into two types: with and without the feature fusion algorithm. The average test accuracy and average test time results for the five different approaches are shown in Figure 6.

The experimental results showed that compared with the multi-channel structure, the three single-channel architectures consume less time, but their test accuracy is also very low, below 0.75. This is because the single-channel structures are more simple, and the extracted effective features are limited. The single-channel structures produced varying results. AlexNet consumes less time; VGG16, with a deeper network, can extract more features, thereby improving accuracy; and LSTM does not have efficient feature extraction capabilities like CNNs, so the accuracy is low when used alone. The multiple-channel structure can combine the extracted features using different deep neural networks, including spatial and temporal features, significantly improving accuracy, but, it takes more time than the single-channel structure. Since the features extracted by different channels may have different degrees of relevance to the classification task, and there may be redundant information between them, a large number of features can affect the time performance and accuracy of the overall model. Therefore, we considered selecting and fusing the features from both correlation and redundancy perspectives. After adding feature selection and fusion algorithms, the average test accuracy of the multi-channel structure improved by more than 10%, reaching 0.942, while the time consumption only increased by 0.87 seconds. Considering both accuracy and time consumption, we think that using a multiple-channel structure and a feature fusion algorithm is necessary.

4.5. Comparison Experiments

To demonstrate the effectiveness and robustness of the proposed method, experiments were conducted under the five scenarios detailed in Table 5.

Scenario A: All unknown traffic in the test dataset has a completely different type of application service than the known traffic in the training dataset.
Scenario B: The unknown traffic in the test dataset has similar application service types as the known traffic in the training dataset.
Scenario C: All unknown attacks in the test dataset have completely different attack mechanisms than the known attacks in the training dataset.
Scenario D: The unknown attacks in the test dataset have similar attack patterns to the known attacks in the training dataset.
Scenario E: All unknown normal and malicious traffic in the test dataset has a completely different type of application service than the known traffic in the training dataset.

Figure 7 and Figure 8 show the normalized confusion matrix of the proposed MI-UTR model for the classification results of known and unknown traffic for each of the five scenarios. The vertical coordinate of the confusion matrix is the true label of the traffic, and the horizontal coordinate is the predicted label from the classifier, where the darker the color in the matrix, the higher the probability of correct classification. We calculated the accuracy, misclassification rate, and

k a p p a

coefficient for each scenario based on the confusion matrix, and these values are indicated below the matrix.

In addition to our proposed MI-UTR model, the other methods used for comparison were as follows:

CNN-GR [49] employs a standard CNN architecture consisting of three convolutional layers and one fully connected layer to identify unknown traffic through the gradient of the first backpropagation.
Open-CNN [30] model adds an OpenMax layer to the standard CNN model with two convolutional layers and two fully connected layers. The OpenMax layer improves upon the SoftMax layer by adding an extra class “unknown” to the activation vectors of SoftMax to detect the unknown attack.
HMCD [50] uses the WGAN-GP generation algorithm to enhance the data and a hybrid neural network based on CNN and LSTM to extract the hierarchical spatial–temporal features of the traffic, which can effectively detect unknown HTTP-based malicious communication traffic.

We employed a simple CNN architecture as a benchmark for comparison with the aforementioned unknown traffic classification algorithm. The comparison results are shown in Table 6, where the optimal results are in bold font.

As a whole, our proposed MI-UTR achieved the highest accuracy and outperformed all the other methods under four different scenarios, which reflected both the excellent accuracies of classifying the known classes and rejecting the unknown classes. This result demonstrated that MI-UTR performs better in classifying the network traffic in the open world than the other three methods. As for the trade-off between TPR and FPR, MI-UTR achieved better FTFs than the other methods. Although the FPRs of HMCD and CNN-GR were lower than those of the other two methods, the TPRs were lower as well; namely, HMCD and CNN-GR both performed poorly on recognizing the unknown samples. Additionally, the TPRs of Open-CNN were extremely high, while the FPR was also high, resulting in low FTFs, which illustrates that Open-CNN misclassifies numerous known samples when rejecting unknown samples. Together, the results demonstrated that MI-UTR performs the best in identifying unknown samples while misclassifying the known samples as little as possible.

To comprehensively compare the precision performance of these classifiers in different scenarios, we plotted their respective ROC curves and calculated the AUC values, as shown in Figure 9.

In Section 4.4, we discussed the considerations regarding the accuracy and time consumption of multi-channel and single-channel architectures. To demonstrate the superiority of our proposed method over other unknown traffic detection algorithms, it is important to consider the comparison not onlyof model accuracy but also of time consumption among different models. From Figure 10, it can be observed that although the three datasets used have different sizes, the processing rate of each algorithm for handling traffic remains relatively constant. CNN-GR and Open-CNN are both optimized versions of the basic CNN architecture to enable the recognition of unknown traffic. As the overall structure changes are not significant, there is not much difference in terms of time consumption among CNN, CNN-GR, and Open-CNN. Due to the use of generative algorithms to augment data, HMCD has more complex parameters and structures, resulting in significantly higher time consumption than the other algorithms. Our proposed MI-UTR algorithm exhibits slightly higher time consumption than several CNN architectures in Scenario A and Scenario B, in the other scenarios, their time consumptions isaresimilar. Overall, our method achieves higher accuracy at a relatively small time cost, demonstrating its superiority in terms of accuracy and efficiency.

4.6. The Percentage of Unknown Traffic Classes

In order to further demonstrate that the proposed method also performs well with different numbers of unknown classes, the following experiments were performed. We introduced the concept of

o p e n n e s s

[51] to characterize the ratio of unknown traffic to known traffic in the testing dataset. Openness is mathematically defined as:

o p e n n e s s = 1 - \sqrt{2 \times \frac{C_{T}}{C_{R} + C_{E}}}

(28)

where

C_{T}

is the number of the known classes used in the training dataset,

C_{E}

is the number of all traffic classes in the testing dataset, and

C_{R}

is the number of traffic classes to be recognized. In this experiment,

C_{R}

was equal to

C_{E}

because of the need to classify all traffic classes in the testing dataset.

As can be seen from the above equation, openness reflects the amount of knowledge available for training the detection model. An openness of 0 indicates a completely known classification problem, while a high openness means that we have less knowledge of what is known for training, and distinguishing known traffic from unknown traffic is more difficult. To investigate the effect of openness on model performance, we conducted experiments under different openness settings. Figure 11, Figure 12 and Figure 13 demonstrate the performance of several unknown traffic classification methods mentioned in the previous section under different openness, with the measures of

A c c

,

F T F

, and

F 1

.

From these figures, it can be seen that our proposed method significantly outperformed the other three methods at high openness. To be specific, the accuracy basically tended to decrease as the openness increased in all four scenarios, but the decreasing trend of our proposed model was slower and even slightly increased in scenario C. In other words, while the accuracy of the four methods under low openness did not much differ, the accuracy of our proposed model under higher openness was much higher than that the other methods, especially for the detection of unknown attacks in scenario C and scenario D. In the test of FTF metrics, our method achieved the best performance at all openness values in the four scenarios, which also means that the method is able to balance the true positive rate of unknown traffic and the false positive rate of known traffic. As for the comparison of F1 score, the scores of OPEN-CNN and of our proposed MI-UTR under scenario A were both relatively close for different openness values; OPEN-CNN performed better at low openness for scenario B, but the performance rapidly decreased with increasing openness; the CNN-GR model under scenario C experienced a similar problem. The MI-UTR model in scenario D always maintained an advantage under all openness values. In summary, our proposed MI-UTR model has strong robustness in various scenarios and for various openness values.

5. Conclusions and Further Work

A novel strategy for detecting unknown traffic using incremental learning was developed in this study. The approach implements an mRMR-based multiple-channel parallel neural network to select the best features from both the temporal and spatial dimensions, which are then fed to the classifier to address the redundancy problem of encrypted traffic features. By learning from the feature database, the classifier can immediately classify the features of the known traffic into specific application categories.

Unknown traffic features are either added to a newly created unknown traffic class or classified as new features of the known traffic category or noise through the clustering algorithm based on the density ratio. The clustered unknown traffic features are incrementally updated into the feature database. The results of experiments on several publicly available datasets demonstrated that our model significantly outperforms existing approaches in a number of application scenarios, including classification of encrypted traffic in VPN tunnel and Tor applications, as well as intrusion detection for unknown traffic classification tasks.

In this section, we also want to highlight some of the limitations of our existing work as well as future work. In the experimental environment, our model was capable of processing traffic at a speed of several hundred megabits per second, indicating its good traffic analysis capabilities in medium-sized networks. Furthermore, due to its incremental learning feature, it exhibits strong online processing capabilities and scalability. However, in larger-scale networks, the processing speed and memory requirements of the model may become a bottleneck. To address this, one approach is to incorporate a sliding window mechanism to control the traffic rate. Additionally, further methods to improve speed include deploying a distributed database, employing higher-performance hardware such as high-speed parsers, and implementing the model in high-performance languages such as C++.

We employed a class-incremental learning (CIL) SVM classifier for active learning to update the existing traffic classifiers in this study. A key criterion for CIL is to strike a balance between stability and plasticity, where the model should possess sufficient stability to retain previous class knowledge while also demonstrating enough plasticity to learn new concepts within new classes. The experimental results showed that the misclassification rate and false positive rate for known traffic are both very low, indicating that the classifier sufficiently learned the known traffic categories. This suggests that the CIL classifier exhibits strong stability in learning known traffic classes. Regarding the learning of unknown traffic, although the misclassification rate was also relatively low in the experiments, it could be attributed to the robust traffic representation being effective. However, it is important to consider that the traffic categories in these datasets are limited to fewer than 20 classes, whereas in real-world scenarios, the number of unknown traffic categories is likely to be much larger. When dealing with a larger number of traffic categories, there is a possibility that the CIL SVM classifier’s risk of errors may increase. To mitigate this risk, strategies such as rehearsal techniques, regularization methods, and episodic memory can be employed to preserve and consolidate knowledge of previously learned classes. Further validation will be conducted when deploying the model in a real production environment, and in the future, balancing stability and plasticity will be two of the optimization directions for the incremental classifier.

Another important limitation is the generalizability of our method. Due to the variability of content and behavior of Internet services, encrypted traffic exhibits significant changes over time, and known application categories may present new patterns and features. Our model may classify the unknown features of these known traffic categories as unknown traffic, leading to a high false negative rate. Therefore, there is a need for more adaptive and dynamic methods to keep up with the evolving nature of encrypted traffic.

Furthermore, the robustness of our approach to adversarial attacks and evasion techniques is an area that requires further investigation. Attackers are constantly evolving their tactics to evade detection, and it is essential to develop defences that can effectively detect and mitigate these advanced evasion techniques. We plan to further investigate the model in the future in an effort to strengthen defences against attacks that use poisoned samples.

In conclusion, although our method has demonstrated promising results, there are significant limitations that need to be addressed. Future research efforts should focus on developing adaptive and dynamic methods, tackle scalability challenges, and improve the robustness of the classification process against adversarial attacks. By addressing these limitations, we can further advance the field of traffic classification and contribute to the development of more effective and efficient solutions for network security.

Author Contributions

J.L. designed and developed the model and wrote the original draft; J.W. conceived the experimental ideas and reviewed the original draft; T.Y., F.Q. and G.C. helped with data analysis and constructive discussions. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partially supported by the Xiejialin Project of Institute of High Energy Physics under grant no. E25467U2 and the specialized project for cybersecurity and informatization in the 14th Five-Year Plan of CAS under grant no. WX145XQ12.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The public dataset used to support the findings of this study can be found at: ISCXVPN2016: https://www.unb.ca/cic/datasets/vpn.html (accessed on 1 May 2023), ISCXTor2016: https://www.unb.ca/cic/datasets/tor.html (accessed on 1 May 2023), NSL-KDD: https://www.unb.ca/cic/datasets/nsl.html (accessed on 1 May 2023). The SelfDataset supporting the results of this study is not publicly available because it contains content that may relate to subsequent unpublished studies.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Variable	Description
$F_{A l e x N e t}$ , $F_{V G G 16}$ , and $F_{L S T M}$	Feature vectors extracted from the neural networks AlexNet, VGG16, and LSTM, respectively.
$F_{1}$ , $F_{2}$ , …, $F_{1000}$	1000 specific feature vectors extracted by each neural network.
$f_{1}$ , $f_{2}$ , …, $f_{1000}$	Different dimensions of each specific feature vector, for a total of 256 dimensions.
$I (x, y)$	Mutual information of x and y.
x, y	Two different feature vectors.
X, Y	Two different feature vector categories.
$p_{1} (x)$ , $p_{2} (y)$	Respective marginal probability distribution functions of x and y.
$p (x, y)$	Joint probability distribution function of x and y.
W	Correlation coefficient of features and categories.
F	A feature set.
S	A feature subset of F.
c	Target category.
R	Redundancy coefficient between two features.
$F_{A l e x N e t}^{'}$ , $F_{V G G 16}^{'}$ , $F_{L S T M}^{'}$	Feature vectors selected from $F_{A l e x N e t}$ , $F_{V G G 16}$ , $F_{L S T M}$ by mRMR algorithm respectively.
$F_{1}^{'}$ , $F_{2}^{'}$ , …, $F_{1000}^{'}$	100 specific feature vectors selected from each neural network.
$F_{F u s i o n}$	Fused vector after the filtering of the three channels.
$f_{K n o w n}$	Set of known features in the feature database.
$f_{K}^{1}$ , $f_{K}^{2}$ , …, $f_{K}^{n}$	Known features.
$y_{K}^{1}$ , $y_{K}^{2}$ , …, $y_{K}^{n}$	Labels of known features.
n	Number of known features in the feature database.
$f_{U n k n o w n}$	Set of unknown features to be recognized.
$f_{U}^{1}$ , $f_{U}^{2}$ , …, $f_{U}^{m}$	Known features.
m	Number of unknown features to be recognized.
$ϵ$	Small neighborhood radius.
$μ$	Larger neighborhood radius.
$N_{ϵ} (f)$	$ϵ$ neighborhood of feature f.
$N_{μ} (f)$	$μ$ neighborhood of feature f.
$τ$	Density radio threshold.
T	Distanse threshold.
C	Set of known class labels.
$C l a s s_{1}$ , $C l a s s_{2}$ , …, $C l a s s_{N}$	Known class labels.
N	Number of known traffic categories.
$C^{'}$	New unknown class labels.
$N o i$	Noise.
$F$	Feature database.
W	Objective function of the multi-class SVM.
w	One-versus-rest hyperplane vector.
b	Bias vector.
e	Slack variable.
$Φ (\cdot)$	Kernel function.
$ψ$	Penalty factor.
s	Current state variable.
$F^{'}$	Updated feature database.
$N^{'}$	Number of updated traffic categories.
$ξ_{s + 1}$	Weight of the old classification plane.
$λ_{s + 1}$	Knowledge weight of the old feature base.
$A c c$	Accuracy.
$T P R$	True positive rate.
$F P R$	False positive rate.
$F T F$	A metric that combines $T P R$ and $F P R$ .
$F 1$	A metric that combines precision and recall rate.
$T P$	True positive.
$F P$	False rositive.
$T N$	True negative.
$F N$	False negative.
$k a p p a$	A metric for consistency.
$p_{o}$	Overall classification accuracy.
$p_{e}$	Expected agreement rate.
$A U C$	Area under ROC curve.
$O p e n e s s$	Ratio of unknown traffic to known traffic in testing dataset.
$C_{T}$	Number of known classes used in training dataset.
$C_{E}$	Number of all traffic classes in testing dataset.
$C_{R}$	Number of traffic classes to be recognized.

References

Cvitić, I.; Peraković, D.; Periša, M.; Gupta, B. Ensemble machine learning approach for classification of IoT devices in smart home. Int. J. Mach. Learn. Cybern. 2021, 12, 3179–3202. [Google Scholar] [CrossRef]
Reddy, D.K.; Behera, H.S.; Nayak, J.; Vijayakumar, P.; Naik, B.; Singh, P.K. Deep neural network based anomaly detection in Internet of Things network traffic tracking for the applications of future smart cities. Trans. Emerg. Telecommun. Technol. 2021, 32, e4121. [Google Scholar] [CrossRef]
D’Alconzo, A.; Drago, I.; Morichetta, A.; Mellia, M.; Casas, P. A survey on big data for network traffic monitoring and analysis. IEEE Trans. Netw. Serv. Manag. 2019, 16, 800–813. [Google Scholar] [CrossRef] [Green Version]
Bhargavan, K.; Cheval, V.; Wood, C. A Symbolic Analysis of Privacy for TLS 1.3 with Encrypted Client Hello. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 7–11 November 2022; pp. 365–379. [Google Scholar]
Kigerl, A. Routine activity theory and malware, fraud, and spam at the national level. Crime Law Soc. Chang. 2021, 76, 109–130. [Google Scholar] [CrossRef]
Holland, J.; Schmitt, P.; Feamster, N.; Mittal, P. New directions in automated traffic analysis. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 15–19 November 2021; pp. 3366–3383. [Google Scholar]
Hasan, K.; Ahmed, K.; Biswas, K.; Islam, M.S.; Sianaki, O.A. Software-defined application-specific traffic management for wireless body area networks. Future Gener. Comput. Syst. 2020, 107, 274–285. [Google Scholar] [CrossRef]
Hussain, F.; Abbas, S.G.; Shah, G.A.; Pires, I.M.; Fayyaz, U.U.; Shahzad, F.; Garcia, N.M.; Zdravevski, E. A framework for malicious traffic detection in IoT healthcare environment. Sensors 2021, 21, 3025. [Google Scholar] [CrossRef] [PubMed]
Shafiq, M.; Tian, Z.; Bashir, A.K.; Du, X.; Guizani, M. IoT malicious traffic identification using wrapper-based feature selection mechanisms. Comput. Secur. 2020, 94, 101863. [Google Scholar] [CrossRef]
Wei, D.; Shi, F.; Dhelim, S. A Self-Supervised Learning Model for Unknown Internet Traffic Identification Based on Surge Period. Future Internet 2022, 14, 289. [Google Scholar] [CrossRef]
Wang, Z.; Fok, K.W.; Thing, V.L. Machine learning for encrypted malicious traffic detection: Approaches, datasets and comparative study. Comput. Secur. 2022, 113, 102542. [Google Scholar] [CrossRef]
Yang, B.; Liu, D. Research on network traffic identification based on machine learning and deep packet inspection. In Proceedings of the 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chengdu, China, 15–17 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1887–1891. [Google Scholar]
Zeng, X.; Chen, X.; Shao, G.; He, T.; Han, Z.; Wen, Y.; Wang, Q. Flow context and host behavior based shadowsocks’s traffic identification. IEEE Access 2019, 7, 41017–41032. [Google Scholar] [CrossRef]
Majeed, U.; Khan, L.U.; Hong, C.S. Cross-silo horizontal federated learning for flow-based time-related-features oriented traffic classification. In Proceedings of the 2020 21st Asia-Pacific Network Operations and Management Symposium (APNOMS), Daegu, Republish of Korea, 22–25 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 389–392. [Google Scholar]
Shafiq, M.; Tian, Z.; Bashir, A.K.; Jolfaei, A.; Yu, X. Data mining and machine learning methods for sustainable smart cities traffic classification: A survey. Sustain. Cities Soc. 2020, 60, 102177. [Google Scholar] [CrossRef]
Shen, M.; Liu, Y.; Zhu, L.; Xu, K.; Du, X.; Guizani, N. Optimizing feature selection for efficient encrypted traffic classification: A systematic approach. IEEE Netw. 2020, 34, 20–27. [Google Scholar] [CrossRef]
Abbasi, M.; Shahraki, A.; Taherkordi, A. Deep learning for network traffic monitoring and analysis (NTMA): A survey. Comput. Commun. 2021, 170, 19–41. [Google Scholar] [CrossRef]
Dong, S.; Xia, Y.; Peng, T. Traffic identification model based on generative adversarial deep convolutional network. Ann. Telecommun. 2022, 77, 573–587. [Google Scholar] [CrossRef]
Liu, Z.; Cai, L.; Zhao, L.; Yu, A.; Meng, D. Towards open world traffic classification. In Proceedings of the Information and Communications Security: 23rd International Conference, ICICS 2021, Chongqing, China, 19–21 November 2021; Proceedings, Part I 23. Springer: Berlin/Heidelberg, Germany, 2021; pp. 331–347. [Google Scholar]
Velan, P.; Čermák, M.; Čeleda, P.; Drašar, M. A survey of methods for encrypted traffic classification and analysis. Int. J. Netw. Manag. 2015, 25, 355–374. [Google Scholar] [CrossRef]
Wang, W.; Zhu, M.; Wang, J.; Zeng, X.; Yang, Z. End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In Proceedings of the 2017 IEEE International Conference on Intelligence and Security Informatics (ISI), Beijing, China, 22–24 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 43–48. [Google Scholar]
Ren, X.; Gu, H.; Wei, W. Tree-RNN: Tree structural recurrent neural network for network traffic classification. Expert Syst. Appl. 2021, 167, 114363. [Google Scholar] [CrossRef]
Wang, X.; Chen, S.; Su, J. App-net: A hybrid neural network for encrypted mobile traffic classification. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada, 6–9 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 424–429. [Google Scholar]
Wang, P.; Wang, Z.; Ye, F.; Chen, X. Bytesgan: A semi-supervised generative adversarial network for encrypted traffic classification in SDN edge gateway. Comput. Netw. 2021, 200, 108535. [Google Scholar] [CrossRef]
Zhao, R.; Deng, X.; Yan, Z.; Ma, J.; Xue, Z.; Wang, Y. MT-FlowFormer: A Semi-Supervised Flow Transformer for Encrypted Traffic Classification. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 2576–2584. [Google Scholar]
Dong, S.; Zhou, D.; Ding, W.; Gong, J. Flow cluster algorithm based on improved K-means method. IETE J. Res. 2013, 59, 326–333. [Google Scholar] [CrossRef]
Wang, Y.; Xiong, G.; Liu, C.; Li, Z.; Cui, M.; Gou, G. CQNet: A clustering-based quadruplet network for decentralized application classification via encrypted traffic. In Proceedings of the Machine Learning and Knowledge Discovery in Databases, Applied Data Science Track: European Conference, ECML PKDD 2021, Bilbao, Spain, 13–17 September 2021; Proceedings, Part IV 21. Springer: Berlin/Heidelberg, Germany, 2021; pp. 518–534. [Google Scholar]
Han, S.; Wu, Q.; Zhang, H.; Qin, B. Light-weight Unsupervised Anomaly Detection for Encrypted Malware Traffic. In Proceedings of the 2022 7th IEEE International Conference on Data Science in Cyberspace (DSC), Guilin, China, 11–13 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 206–213. [Google Scholar]
Leo, J.; Kalita, J. Incremental deep neural network learning using classification confidence thresholding. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 7706–7716. [Google Scholar] [CrossRef]
Zhang, Y.; Niu, J.; Guo, D.; Teng, Y.; Bao, X. Unknown network attack detection based on open set recognition. Procedia Comput. Sci. 2020, 174, 387–392. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, Y.; Niu, J.; Guo, D. Unknown network attack detection based on open-set recognition and active learning in drone network. Trans. Emerg. Telecommun. Technol. 2022, 33, e4212. [Google Scholar] [CrossRef]
Xia, Y.; Xiong, G.; Li, Z.; Gou, G.; Liu, C. GMAF: A Novel Gradient-Based Model with ArcFace for Network Traffic Classification. In Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Haikou, China, 20–22 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 291–300. [Google Scholar]
Paramasivam, S.; Velusamy, R.L. Cor-ENTC: Correlation with ensembled approach for network traffic classification using SDN technology for future networks. J. Supercomput. 2023, 79, 8513–8537. [Google Scholar] [CrossRef]
Liang, Y.; Wang, F.; Chen, S. DACS: A Double-layer Application Classification Scheme for Hybrid Zero-day Traffic. In Proceedings of the 2022 IEEE 22nd International Conference on Communication Technology (ICCT), Nanjing, China, 11–14 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1380–1387. [Google Scholar]
Zhao, S.; Zhang, Y.; Sang, Y. Towards unknown traffic identification via embeddings and deep autoencoders. In Proceedings of the 2019 26th International Conference on Telecommunications (ICT), Hanoi, Vietnam, 8–10 April 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 85–89. [Google Scholar]
Zhang, Y.; Zhao, S.; Sang, Y. Towards unknown traffic identification using deep auto-encoder and constrained clustering. In Proceedings of the Computational Science–ICCS 2019: 19th International Conference, Faro, Portugal, 12–14 June 2019; Proceedings, Part I 19. Springer: Berlin/Heidelberg, Germany, 2019; pp. 309–322. [Google Scholar]
Pathmaperuma, M.H.; Rahulamathavan, Y.; Dogan, S.; Kondoz, A.M. Deep Learning for Encrypted Traffic Classification and Unknown Data Detection. Sensors 2022, 22, 7643. [Google Scholar] [CrossRef] [PubMed]
Hu, X.; Gu, C.; Chen, Y.; Chen, X.; Wei, F. OpenCBD: A Network-Encrypted Unknown Traffic Identification Scheme Based on Open-Set Recognition. Wirel. Commun. Mob. Comput. 2022, 2022, 1746373. [Google Scholar] [CrossRef]
Zhang, J.; Li, F.; Ye, F.; Wu, H. Autonomous unknown-application filtering and labeling for dl-based traffic classifier update. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, Toronto, ON, Canada, 6–9 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 397–405. [Google Scholar]
Liu, Q.; Li, M.; Cao, N.; Zhang, Z.; Yang, G. Improved Harris Combined With Clustering Algorithm for Data Traffic Classification. IEEE Access 2022, 10, 72815–72824. [Google Scholar] [CrossRef]
Fu, Y.; Li, X.; Li, X.; Zhao, S.; Wang, F. Clustering unknown network traffic with dual-path autoencoder. Neural Comput. Appl. 2023, 35, 8955–8966. [Google Scholar] [CrossRef]
Wang, W.; Zhu, M.; Zeng, X.; Ye, X.; Sheng, Y. Malware traffic classification using convolutional neural network for representation learning. In Proceedings of the 2017 International Conference on Information Networking (ICOIN), Da Nang, Vietnam, 11–13 January 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 712–717. [Google Scholar]
Du, J. Understanding of object detection based on CNN family and YOLO. In Proceedings of the 2nd International Conference on Machine Vision and Information Technology (CMVIT 2018), Journal of Physics: Conference Series, Hong Kong, China, 23–25 February 2018; IOP Publishing: Bristol, UK, 2018; Volume 1004, p. 012029. [Google Scholar]
Tokunaga, H.; Teramoto, Y.; Yoshizawa, A.; Bise, R. Adaptive weighting multi-field-of-view CNN for semantic segmentation in pathology. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12597–12606. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 2005, 3, 185–205. [Google Scholar] [CrossRef]
Yang, L.; Finamore, A.; Jun, F.; Rossi, D. Deep learning and zero-day traffic classification: Lessons learned from a commercial-grade dataset. IEEE Trans. Netw. Serv. Manag. 2021, 18, 4103–4118. [Google Scholar] [CrossRef]
Yun, X.; Xie, J.; Li, S.; Zhang, Y.; Sun, P. Detecting unknown HTTP-based malicious communication behavior via generated adversarial flows and hierarchical traffic features. Comput. Secur. 2022, 121, 102834. [Google Scholar] [CrossRef]
Geng, C.; Huang, S.J.; Chen, S. Recent advances in open set recognition: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3614–3631. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Incremental unknown traffic recognition system framework.

Figure 2. Two CNN network structures. The upper part is the network structure of AlexNet, and the lower part is the network structure of VGG16.

Figure 3. Loss function of the model during pretraining.

Figure 4. The relationship between the number of fused features and the accuracy of model detection.

Figure 5. Histogram of the distance distribution of all features from the average feature.

Figure 6. Test accuracy and time of single-channel and multiple-channel structures.

Figure 7. Confusion matrix of known traffic classification under different scenarios. (a) Scenario A. (b) Scenario B. (c) Scenario C. (d) Scenario D. (e) Scenario E.

Figure 8. Confusion matrix of unknown traffic classification under different scenarios. (a) Scenario A. (b) Scenario B. (c) Scenario C. (d) Scenario D. (e) Scenario E.

Figure 9. ROC curves of classifiers under different scenarios.

Figure 10. Comparison of time consumption of different algorithms.

Figure 11. Accuracy of unknown traffic classification under different openness.

Figure 12. FTF of unknown traffic classification under different openness values.

Figure 13. F1 score of unknown traffic classification under different openness.

Table 1. The traffic categories of ISCX-VPN-Tor dataset.

Dataset	Traffic Classes	Number of Flow
ISCX-VPN-Tor	VPN-Email	2054
	Tor-Email	1866
	VPN-Chat	1365
	Tor-Chat	1574
	VPN-Streaming	13,682
	Tor-Streaming	6839
	VPN-File Transfer	4732
	Tor-File Transfer	8874
	VPN-VoIP	3639
	Tor-VoIP	7680
	VPN-P2P	6635
	Tor-P2P	3453
	Total	62,392

Table 2. The traffic categories of NSL-KDD dataset.

Dataset	Traffic Classes	Subclasses	Number of Flows
NSL-KDD	DoS	back	3143
		land	14,396
		neptune	851
		pod	1328
		smurf	9743
		teardrop	6261
	Probe	ipsweep	3285
		portsweep	3572
		nmap	2097
		satan	638
	U2R	rootkit	193
		buffer_overflow	97
		loadmodule	265
		perl	121
	R2L	ftp_write	626
		guess_password	305
		imap	593
		multihop	122
		phf	31
		spy	54
		warezclient	89
		warezmaster	106
	Normal		48,382
	Total		96,298

Table 3. The traffic categories of SelfDataset.

Dataset	Traffic Classes	Number of Flow
SelfDataset	Altium	309
	Vivado	2492
	Cadence	130
	AutoCAD	1743
	Mathematica	164
	exploitnou	47
	ssh_bruteforce	83
	miner	71
	Total	5039

Table 4. The parameter values of the pretrained neural networks.

Neural Network Aachitecture	AlexNet/VGG16/LSTM
Input size	$32 \times 32$
Optimization	SGD
Momentum	0.9
Decay	10⁻⁴
Mini-batch	32
Learning rate	0.001

Table 5. Known traffic and unknown traffic settings for different scenarios.

Dataset	Setting
ISCX-VPN-Tor		Scenario A	Scenario B
	Known Classes	Email, Streaming, File Transfer, P2P	vpn_spotify_audio, vpn_vimeo_video, vpn_aim_chat, vpn_ftps_filetransfer, tor_mail_imap, p2p_multiplespeed, icq_chat, tor_skype_chat, tor_aim_chat, tor_video_vimeo, tor_filetransfer_skype, tor_voip_facebook, tor_voip_hangouts
	Unnown Classes	Chat, VoIP	vpn_skype_audio, vpn_hangouts_audio, facebook_video, vpn_facebook_chat, vpn_skype_filetransfer, tor_mail_pop, tor_p2p_vuze, tor_facebook_chat, tor_video_youtube, tor_filrtransfer_ftp, tor_filrtransfer_sftp, ssl
NSL-KDD		Scenario C	Scenario D
	Known Classes	DoS, R2L	back, land, pod, portsweep, satan, rootkit, ftp_write, guess_password, multihop, warezclient
	Unnown Classes	Probe, U2R	neptune, smurf, ipsweep, nmap, loadmodule, imap, warezmaster
SelfDataset		Scenario E
	Known Classes	Altium, Vivado, AutoCAD, Mathematica, exploitnou
	Unnown Classes	Cadence, ssh_bruteforce, miner

Table 6. Experimental results of comparison of MI-UTR with other methods.

Methods	Scenario A					Scenario B
Methods	Acc	TPR	FPR	FTF	F1	Acc	TPR	FPR	FTF	F1
CNN	46.02%	55.87%	64.29%	53.74%	60.20%	55.35%	71.73%	68.89%	56.26%	64.17%
CNN-GR	72.49%	98.16%	33.41%	74.28%	70.31%	77.04%	96.04%	37.47%	70.21%	75.92%
OPEN-CNN	82.66%	58.36%	11.34%	67.04%	85.79%	84.29%	98.37%	70.30%	58.32%	94.67%
HMCD	68.83%	86.95%	20.34%	77.34%	75.28%	73.36%	85.37%	18.33%	80.16%	75.89%
MI-UTR	86.15%	90.88%	16.38%	80.48%	83.78%	88.60%	93.27%	12.89%	87.93%	81.38%
Methods	Scenario C					Scenario D
Methods	Acc	TPR	FPR	FTF	F1	Acc	TPR	FPR	FTF	F1
CNN	57.98%	50.24%	66.22%	40.86%	53.88%	55.32%	48.33%	67.28%	55.90%	58.17%
CNN-GR	60.72%	98.02%	43.60%	78.94%	93.70%	64.38%	97.48%	65.42%	55.06%	64.78%
OPEN-CNN	71.78%	62.82%	60.69%	74.56%	82.66%	76.85%	98.31%	51.47%	61.25%	82.64%
HMCD	66.05%	72.27%	64.28%	59.65%	74.02%	70.91%	82.31%	27.94%	73.59%	86.35%
MI-UTR	90.36%	95.30%	13.28%	91.05%	88.45%	91.88%	93.39%	12.66%	91.42%	93.66%
Methods	Scenario E
Methods	Acc	TPR	FPR	FTF	F1
CNN	75.05%	85.52%	35.28%	74.55%	81.67%
CNN-GR	92.22%	90.78%	24.96%	87.27%	92.90%
OPEN-CNN	84.65%	93.52%	10.45%	91.58%	86.15%
HMCD	89.27%	91.33%	28.49%	87.34%	90.88%
MI-UTR	96.25%	94.66%	3.75%	93.96%	97.43%

The optimal metrics for each scenario are marked in bold font.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Wang, J.; Yan, T.; Qi, F.; Chen, G. Unknown Traffic Recognition Based on Multi-Feature Fusion and Incremental Learning. Appl. Sci. 2023, 13, 7649. https://doi.org/10.3390/app13137649

AMA Style

Liu J, Wang J, Yan T, Qi F, Chen G. Unknown Traffic Recognition Based on Multi-Feature Fusion and Incremental Learning. Applied Sciences. 2023; 13(13):7649. https://doi.org/10.3390/app13137649

Chicago/Turabian Style

Liu, Junyi, Jiarong Wang, Tian Yan, Fazhi Qi, and Gang Chen. 2023. "Unknown Traffic Recognition Based on Multi-Feature Fusion and Incremental Learning" Applied Sciences 13, no. 13: 7649. https://doi.org/10.3390/app13137649

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unknown Traffic Recognition Based on Multi-Feature Fusion and Incremental Learning

Abstract

1. Introduction

2. Related Work

2.1. Known Internet Traffic Identification

2.2. Unknown Internet Traffic Identification

3. Design of Framework

3.1. Data Preprocessing

3.2. Feature Extraction and Fusion

3.2.1. Spatial and Temporal Feature Extraction

3.2.2. Feature Selection and Fusion

3.3. Unknown Traffic Recognition

3.4. Incremental SVM Classifier

4. Evaluation

4.1. Datasets for Evaluation

4.2. Experiment Settings

4.3. Selection of Feature Number and Threshold

4.4. Necessity of Multiple-Channel and Feature Fusion

4.5. Comparison Experiments

4.6. The Percentage of Unknown Traffic Classes

5. Conclusions and Further Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI