4.1. Dataset Description
NSL-KDD dataset (
http://www.unb.ca/research/iscx/dataset/iscx-NSL-KDD-dataset.html) [
35] is a new version of KDD Cup’99 dataset. The KDD Cup’99 contains train and test sets which are duplicated about 78% and 75% of the records, respectively. Thus, NSL-KDD dataset was redundant records in the train set and no duplicate records in new test sets. Besides, this dataset still remains 41 input features and 1 output feature. The input features include [
duration, protocol_type, service, flag, src_bytes, dst_bytes, land, wrong_fragment, urgent, hot, num_failed_logins, logged_in, num_compromised, root_shell, su_attempted, num_root , num_file_creations, num_shells, num_acces_files, num_outbound_cmds, is_host_login, is_guest_login, count, srv_count, serror_rate, srv_serror_rate, rerror_rate, srv_rerror_rate, same_sr_rate, diff_srv_rate, srv_diff_host_rate, dst_host_count, dst_host_sr_count, dst_host_same_srv_rate, dst_host_diff_srv_rate, dst_host_same_src_por_rate, dst_host_srv_diff_host_rate, dst_host_serror_rate, dst_host_srv_serro_rate, dst_host_rerror_rate, dst_host_srv_rerror_rate]. The output name is [
Attack]. The output value consists of [ ’Normal’, ’R2L’, ’DoS, ’U2R’, ’Probe’]. As a result, several records in train and test sets are reasonable which makes it affordable to run the experiments on the complete set without the need to randomly select a small partition. The experiment uses KDDTest+ dataset file in learned models.
In ISCX dataset (
http://www.unb.ca/cic/research/datasets/ids.html), the real-life dataset was created by collecting network traffic data for several consecutive days. ISCX in 2012 is a real-life dataset that builds on the concept of profiles which include details on intrusions. This dataset is created by Shiravi Ali, et al. [
36]. This dataset was designed specifically for developing, testing, and evaluating network intrusion and anomaly detection algorithms. It uses two profiles,
and
, during the generation of the datasets. The
profiles were constructed using the knowledge of specific traces. Real packet traces were analyzed to create
and
profiles for agents that generated real-time traffic for HTTP, SMTP, SSH, IMAP, POP3, and FPT protocols. Various multi-stage attack scenarios were explored to generate malicious traffic. This dataset consists of seven days of network activities, both normal and malicious. A pcap extension file (.*pcap) and XML extension file (.*xml) are two extension files.
In data preparation, the ready-made training and testing datasets are not available in the original dataset, and it is difficult to perform experiments on huge data (*.pcap files). Hence, the file “labelled_flows_xml” which contained flow information in XML format for each day are used. Furthermore, the labeled flow file supports the use of supervised machine learning algorithms. The flows were generated using IBM WRadar appliance. The flow file was labeled with “Normal” and “Attack”. The [Tag] feature indicates whether the flow is normal or part of an attack scenario. However, all flows from day 1–Friday (11 June 2010) were normal; therefore no flow XML file was included. The XML files contained 19 attributes for input values and one attribute for the output value. The attributes for each day data file include [appName, totalSourceBytes, totalDestinationBytes, totalDestinationPackets, totalSourcePackets, sourcePayloadAsBase64, sourcePayloadAsUTF, destinationPayloadAsBase64, destinationPayloadAsUTF, direction, sourceTCPFlagsDescription, destinationTCPFlagsDescription, Source, protocolName, sourcePort, destination, destinationPort, startDateTime, stopDateTime, Tag].
In the preprocessing dataset, the *.XML file is read and converted to *.CSV file. Each attribute in the XML file is a column in the CSV file. The
[Tag] feature is the output which contains target values. The other features are input features as well as input values. A cross-validation technique is used to split the preprocessing dataset that randomly followed the ratio of 75% and 25%, respectively.
Table 1 presents a description of the number of training and testing data of ISCX dataset.
Further adjustments were made to make the data fit for use. Reduction of the number attributes from all the possible attributes have to be carried out. The following attributes were chosen for the experiment: [appName, totalSourceBytes, totalDestinationBytes, totalDestinationPackets, totalSourcePackets, direction, sourceTCPFlagsDescription, destinationTCPFlagsDescription, source, protocolName, sourcePort, destination, destinationPort, startDateTime, stopDateTime, Tag]. Some accumulative or redundant attributes such as [sourcePayloadAsBase64, sourcePayloadAsUTF, destinationPayloadAsBase64, destinationPayloadAsUTF] were removed.
4.2. Experiment Results
Three experiments on NSL-KDD and ISCX datasets are performed. These experiments are implemented on Windows 10 and used Python language.
The first experiment is to build SFSDT model to generate the best feature subset on both IDS datasets. The proposed model can generate the list of combination feature subsets. The best feature subset is selected based on the best score of accuracy and error.
The second experiment is to detect types of attack in both datasets. This work builds three classifiers of various RNNs including conventional RNN, LSTM, and GRU. These approach models are learned on the best-selected feature subset by the proposed model. In NSL-KDD dataset, this task points the classification result of four attacks including [’R2L’, ’DoS, ’U2R’, ’Probe’] and none attack is [’Normal’]. In the ISCX dataset, the detection result of two classes including [’Normal’] and [’Attack’] are presented. The classification results are evaluated based on confusion matrix and ROC.
The final experiment is to measure memory profiles of the learning models including memory used and time executed in both cases. The first case is on the original feature set. The second case is on the selected feature subset.
4.2.1. Experiment 1
The proposed SFSDT model is implemented on both datasets including NSL-KDD and ISCX. The original feature number of these datasets are 41 and 15, respectively. This model visualizes how much accuracy and error scores obtained for each combination number of features in each dataset.
In NSL-KDD dataset, the results of SFSDT model are plotted in
Figure 9a,b. From observation, the proposed model achieved the highest accuracy 0.969017 at the number of combination features is
. Besides, the minimum error score at 12 combined features is 0.00336. Therefore, the best feature subset is selected including 12 features. The corresponding to the list of selected feature subset is [
protocol_type, service, flag, src_bytes, logged_in, num_file_creations, is_guest_login, count, srv_count, dst_host_srv_diff_host_rate, dst_host_rerror_rate, dst_host_srv_rerror_rate].
Similar to ISCX dataset,
Figure 10 and
Figure 11 show the accuracy and error scores of the proposed model for each day dataset. For example, the best feature subset obtained at 3 combined features which are [
appName, totalDestinationBytes, source] in Saturday data.
In summary, the results of the selected feature subsets detail on two datasets are shown in
Table 2.
No.FF is the number of features in the original dataset.
No.SF is the number of the best-selected feature subset by SFSDT algorithm.
4.2.2. Experiment 2
This experiment presents the results of variant RNN models are learned on the selected feature subset of experiment 1. These models are Simple RNN, LSTM, and GRU. The output class of ISCX dataset is [’Normal’, ’Attack’]. Hence, it is considered in binary classification in three learning models. [’Normal’] value is denoted by 0 and [’Attack’] value is denoted by 1. In NSL-KDD dataset, the output class includes non-attack and four types of attack, denoted by [’Normal’, ’R2L’, ’DoS’, ’U2R’, ’Probe’]. In experiment, visualization of ROC results of [’Normal’, ’R2L’, ’DoS’, ’U2R’, ’Probe’] are denoted by [’class 0’, ’ class 1’, ’class 2’, ’class 3’, ’class 4’], respectively. The number of output class is more than two, thus it is a multi-classification problem for three models. Therefore, confusion matrix and ROC measurement are used to visualize the attack type and non-attack detection results.
In the NSD-KDD dataset, the attack classification results are illustrated in
Figure 12. The confusion matrix results display the number of correct and incorrect prediction compare to actual output class for each output class. The number of corrected predictions are displayed on the main diagonal of the confusion matrix.
Besides, the ROC curve is used to measure the performance of different attacks detection on NSL-KDD dataset (see
Figure 13). Most attack type detection achieved better results on LSTM and GRU models.
Similar to the ISCX dataset, the confusion matrix results of three models are displayed in
Figure 14 (on Saturday, Sunday, and Monday data) and
Figure 15 (on Tuesday, Wednesday, and Thursday data).
Based on the confusion matrix results, the accuracies of attack types detection in each dataset are calculated and summarized in
Table 3. The average accuracies of RNN, LSTM, and GRU model on NSL-KDD subset feature data are 89.6%, 92%, and 91.8%. Similar to on ISCX subset feature data, 94.75%, 97.5%, and 97.08% are average obtained accuracies of RNN, LSTM, and GRU model, respectively. Therefore, the LSTM model has slightly better performance compared to the other models.
4.2.3. Experiment 3
As mentioned in
Section 4.2.2, the LSTM model obtained the best accuracy among three approach models. Hence, the LSTM model is selected to measure memory profiles in two cases. Case 1 calculates memory profiles on complete feature dataset of LSTM model. Case 2 calculates memory profiles on the best feature subset generated by SFSDT model of LSTM model. The memory profile reports to the memory used (in MiB unit) and time executed (in the second unit) of Python scripts. This experiment performed to measure running an executable of learning model, recording memory usage and plotting the recorded memory usage.
Figure 16 shows the memory profiles of LSTM model on NSL-KDD dataset. The used memory and the time for training of two cases are quite small different about memory profiles. In particular, memory used and time compiling of LSTM model is trained on selected feature subset occupied under 250 MiBs and near 50 s, respectively. While the LSTM model is trained on complete feature dataset obtained 300 MiBs and approximately 60 s corresponding to the memory used and time compiling. In the ISCX dataset, the memory profile results of LSTM model are plotted in
Figure 17 and
Figure 18.
In summary, the memory used and time executed of LSTM model on both cases ISCX dataset are listed in
Table 4. Smaller values are better. Obviously, case 2 obtained almost better results of memory profiles on both memory used and time executed.
On the other hand, this work measured the memory profiler of the proposed SFSDT model to find the best feature subset on both datasets.
Figure 19 shows the memory profiler of the proposed model on NSLKDD dataset. Besides,
Figure 20 presents the memory profiler of the SFSDT model on ISCX dataset.
In summary, the memory profiler results of SFSDT model are pointed in
Table 5. In NSL-KDD dataset, SFSDT model spent 63 s for time executed and used 145 MiB memory to generate and find the best features. Besides, in ISCX dataset, this proposed model spent averages 120.83 s and 573.33 MiB for time executed and memory used, respectively.
4.3. Discussion
Presently, there are many fields which contain big data, for example, finance, health care, stock, banking, etc. To analyze these datasets, it requires spending more effort with the different methods are used. It becomes a challenge with huge data with high dimensional features. A good feature selection technique helps us to analyze and evaluate to choose the important features in big data that ensure without losing important information. Besides, a good method helps reduce the effort for data analysis about time and cost. The proposed method can be applied to other fields which have big data to generate the best feature subset supporting further prediction model or using the result of the proposed method for other activities such as statistic, prediction, etc. In other words, this method is a low-cost design which helps data analyst can make a quick and accurate decision about what features are important and effect and then keeping them for supporting another further purpose.
In the proposed scheme, SFSDT’s goal is to generate and find the best feature subset from the complete feature set. The result of the proposed method on ISCX data are different in each day because their valuable data are different, even though each day the dataset has the same 15 original features. Hence, the obtained result of the proposed depends on the values of features. Therefore, when there is suspicious traffic with the different feature sets, the proposed model can accurately recognize it, to generate feature subsets which contain this different feature and then evaluate these subsets are the best feature or not. Besides, the proposed SFSDT goal is a feature selection based on the learning model. This proposed model can solve the high-dimensional data leading to the curse of dimensionality phenomenon in big data. The results of experiment 3 show that the proposed method can find the best feature subset in short time and small memory used.
Furthermore, the variant RNNs applied to the best feature subset can reduce the amount of required computation as well as improve performance accuracies on each attack classification and the average accuracy of classification IDS model. In particular, this work compares the proposed model to previous models on two criteria including detection of attack types and intrusion detection accuracies on both IDS datasets.
First,
Table 6 and
Table 7 show the comparison accuracies of detecting attack types between the proposed models to in advance IDS models on NSL-KDD dataset and ISCX dataset, respectively. Based on the results obtained, LSTM and GRU models outperform than others. In particular, the accurate detection of U2R and R2L attacks are improved significantly on LSTM and GRU models.
Second,
Table 8 and
Table 9 show the comparison results of the approached models to well-known IDS models on both datasets. The approached LSTM model outperformed accuracies than other IDS models on both datasets.