Feature Level Fusion of Multi-Source Data For Network Intrusion Detection
Feature Level Fusion of Multi-Source Data For Network Intrusion Detection
Corresponding Author:
Harshitha Somashekar
Department of Information Science and Engineering, Adichunchanagiri Institute of Technology
affiliated to Visvesvaraya Technological University
Belagavi 590018, Karnataka, India
Email: sh@mcehassan.ac.in
1. INTRODUCTION
The millions of autonomous systems connect billions of people to the internet globally. The
exponential increase in internet traffic has been widely observed for many years. This enormous increase in
network traffic includes information from a wide variety of sources. Importantly, this data may contain
various anomalies that might attack network security [1]. To prevent these problems, a variety of
technologies are used, including firewalls, user authentication, and data encryption methods. Analysis alone
is insufficient when it comes to these technologies. Several network intrusion detection systems (NIDS) are
used to examine the network packets more in-depth than standard methods for intrusion detection [1] and
intrusion tolerant [2] systems in order to get beyond the limitations of these mechanisms.
In recent years, a new generation of network security solutions known as NIDS has appeared,
following the rapid advancement of more established security measures like data encryption and firewalls
[3]. Due to its ability to effectively fend off countless attacks and destructive activities, it is known as the
internets second line of protection. Yet, in the age of big data, NIDS has significant difficulties due to the
volume of traffic data. First off, massive quantities of multi-scale data demand a lot of computational and
storage power and make processing more challenging. Second, a lot of duplicate and unrelated data may
make it difficult to detect network vulnerabilities. Finally, large data processes and analytics make it
challenging to identify some emerging assaults. Also, there is a pressing need for efficient solutions due to
the innate flaws of NIDSs, namely their high rates of false positives (FP) and false negatives (FN). In recent
years, data fusion a potential big data technology has been used in the field of NIDS to address the
aforementioned issues. Broadly speaking, depending on where fusions are needed, data fusion may be
implemented in three layers: data, feature, and the decision layer. The data layer is the most basic system
layer, is in charge of integrating and processing raw network data; the feature layer, the next layer up, is in
charge of combining and condensing the features of the preprocessed data; and the decision layer, the top
layer, is in charge of integrating and combining the inferences or decisions made by various processing units.
Most data fusion studies in the field of NIDS only pay attention to the feature layer and decision layer.
Because, the public datasets that have previously undergone data fusion have the network data that they need
to fuse. The efficiency of NIDSs may be increased by using data fusion technology at the feature level to
significantly reduce the bulk of data processing. Also, the robustness and precision of the system may be
increased and decision-making supported by the valuable and improved data produced by feature fusion.
Data fusion is an interdisciplinary research area with several potential applications in domains including
target detection, intrusion detection, image recognition, and autonomous control.
The brief introduction to data fusion applications that follows is based on a survey of selected
relevant literature. By incorporating it into intelligent buildings, author showed out a data-fusion-based fire
automation control system [4]. A smart home control system based on data fusion was proposed by
Zhang et al. [5]. It combines data from several sources to manage home appliances and create an intelligent
living space. The characteristics needed to identify a missile target are extracted using two charge coupled
device cameras and an infrared sensor [6], which proposes a data fusion system based on Dempster-Shafer
(D-S) evidence reasoning. When compared to the strategy of employing just one sensor, the likelihood of
identification achieved by merging the three sensors with D-S evidence is significantly higher. A wireless
sensor network-based fire alarm system was created by Xiangdong and Xue [7] using data fusion fuzzy
theory. This technology increases the monitoring's intelligence while also providing accurate detection. The
suggested approach outperforms conventional single-sensor diagnostic approaches and has great
performance. A deep model for categorization and data fusion in remote sensing was presented [8]. To
effectively extract abstract information properties from light detection and ranging (LiDAR) and
hyperspectral image data, the neural network is utilized. After then, deep neural networks (DNN) were
utilized to combine the many properties that CNN had discovered. The suggested depth fusion model offers
comparable classification accuracy results. The suggested deep learning concept also creates new prospects
for fusing remote sensing data in the future. According to Yan et al. [9], Yanet, utilized data fusion to
reputation generation and suggested an opinion fusion and mining-based reputation generating approach. The
opinions were combined and grouped into several primary opinion sets, each of which contained opinions
with related or identical attitudes. The rating is averaged based on various opinion sets to normalize the
entity's reputation. The accuracy and adaptability of the strategy were shown by experimental findings from
real data analysis of numerous well-known commercial websites in Chinese and English.
Liu et al. [10] gathered four publications to research the use of data fusion in the IoT. IoT produces a
lot of enormous, multi-sourced, heterogeneous, dynamic, and sparse data thanks to a lot of wireless sensor
devices. They stated in the special issue that they thought data fusion was a crucial instrument for organizing
and analyzing this data in order to increase processing effectiveness and offer cutting-edge insight. At each level
of data processing in the IoT, using the synergy between the datasets, data fusion can reduce the amount of data,
filter noise measures, and make conclusions. A cluster based data fusion model for intrusion detection was
described. Before reaching a final analytic result, the model uses a centralized way to aggregate input from
several analyzers. Previous research has explored the impact of fusion on a limited number of classifiers but did
not explicitly investigate its effect on all classifiers used. The outcomes of these studies indicated unsatisfactory
results for the selected classifiers, and also not more research work is carried out on multi-source datasets. The
key advantages of the suggested technique are its versatility in scaling and accuracy in fusing data from several
detecting modules. Moreover, the data fusion module considers each analyzer's effectiveness in the fusion
process and has the ability to foresee impending network threats. The following are the main contributions of
the proposed research work: i) to perform data fusion between the NSL-KDD and UNSW-NB15 multi-source
datasets and ii) to utilize the merged data with a machine learning algorithm to evaluate the performance.
2. PROPOSED METHOD
The four primary components of our proposed intrusion detection approach are dataset and feature
selection, data fusion, and finally machine learning implementation, as illustrated in Figure 1. We explored
the proposed approach in this section. Initially, two open datasets are chosen for model building: NSL-KDD
Feature level fusion of multi-source data for network intrusion detection (Harshitha Somashekar)
2958 ISSN: 2252-8938
[11] and UNSW-NB15 [12]. Second, based on a literature review, the pertinent data attributes of the
NSL-KDD and UNSW-NB15 datasets are chosen [13]. Finally, the datasets are combined during the data
fusion at the feature level with an inner join operation as shown in Figure 2 using the KNIME tool. The
outcomes of machine learning-based models using the combined dataset are then assessed. Proposed
algorithm and stepwise experimental procedure. Algorithm 1 shows the details of proposed algorithm used
for experiment.
// Function definitions:
‒ InnerJoin(D1, D2): performs inner join operation on datasets D1 and D2
‒ Train(DF): trains machine learning models on dataset DF
The proposed steps in Algorithm 1 can be used for any datasets for optimal results. A join procedure
joins two separate tables row-by-row. Every row from the left table that has identical values in one or more
joining columns is merged with every row from the right table. The output can also contain rows that were
mismatched. The inner join operation will give the output table which contains the data present in both
tables. After data sets are fused using the inner join operation new data samples are obtained for both training
and testing. The new data sets are set as input to three machine learning algorithms, they are gradient boosted
tree, ensemble tree, and random forest, the final results are obtained as shown in Figure 3. The simulation
model setup shown in the Figure 3 is carried out using KNIME tool. The steps of simulation procedure are:
Step 1: Create new environment
Step 2: Drag and drop the required icon from the tool box.
Step 3: connect the nodes as shown in the Figure 3.
Step 4: Load the training and testing .CSV files to CSV reader.
Step 5: Click on run button in the menu.
Step 6: Find the results in scorer icon.
Feature level fusion of multi-source data for network intrusion detection (Harshitha Somashekar)
2960 ISSN: 2252-8938
Table 1. Classification accuracy for standard datasets Table 2. Classification accuracy for fused data sets
Sl.no Classifiers Accuracy Sl.no Classifiers Accuracy
1 Tree ensemble 93.0 1 Tree ensemble 96.78
2 Gradient boosted tree 93.8 2 Gradient boosted tree 95.90
3 Random forest 92.8 3 Random forest 97.30
The acquired findings are contrasted with various forms of study; Table 3 displays various outcomes
from various methods with a range of data set sizes and also takes various sorts of assaults into consideration
[20]. The proposed feature-level fusion models showed prominent results with increased accuracy when
compared with the state of art research work. Further DNN models [21], [22] can be used to improve the results.
This research examined how employing the inner join data fusion operation affects various classifiers. While
prior studies have examined fusion's impact using only a few classifiers, they did not specifically address its
influence on every classifier utilized. Previous studies reported subpar results for the chosen classifiers. However,
in this proposed study, all classifiers considered for experimentation yielded significant outcomes. The proposed
model didn’t focus on the time taken for execution, instead concentrated on finding the anamolies efficiently.
Table 3. Comparing the results of the proposed model with related studies
Reference Algorithms Accuracy
[23] Hidden naïve Bayes 88.2 - 94.6
[24] C4.5, DT 79.5
[25] J48, SVM, CFS 70-99.8
[26] Naïve Bayes 79
[27] RF algorithm 70-86
[28] Kmeans 81.6
[29] K-NN 94
[29] Naïve Bayes 89
[30] EM 78
Proposed feature-level fusion model Tree ensemble 96.7
Proposed feature-level fusion model Gradient boosted tree 95.9
Proposed feature-level fusion model Random forest 97.3
5. CONCLUSION
New assaults are also launched along with the increase in Internet users. The effectiveness and security
of the network as a whole are greatly impacted by these attacks. NIDS are employed to prevent these assaults.
However, a false alert is a major difficulty because of the volume and unreliability of the data. This research
suggests a feature-level data fusion approach for intrusion detection as a solution. This method relies on a data
fusion process, which combines data from several sources in order to give more accurate and valuable data. The
relational algebraic inner join method is used to carry out the data fusion. KNIME's analytical tool is used to carry
out this procedure. Machine learning methods are further constructed using this reliable and consistent data. For
classification, the methods gradient boosted, tree ensemble, and random forest are utilized. The thorough
simulation demonstrates our findings provide conclusive evidence that the feature-level data fusion approach
increases IDS's overall effectiveness while reducing the number of false alarms. The results obtained by proposed
mapping of data sets using inner join data fusion. The resource efficiency of our method can be improved in
future work. The improvement in time complexity of the proposed algorithm may also include as the future work.
REFERENCES
[1] I. F. Kilincer, F. Ertam, and A. Sengur, “Machine learning methods for cyber security intrusion detection: datasets and
comparative study,” Computer Networks, vol. 188, 2021, doi: 10.1016/j.comnet.2021.107840.
[2] H. Kwon, Y. Kim, H. Yoon, and D. Choi, “Optimal cluster expansion-based intrusion tolerant system to prevent denial of service
attacks,” Applied Sciences, vol. 7, no. 11, pp. 1–14, 2017, doi: 10.3390/app7111186.
[3] J. Tian, W. Zhao, R. Du, and Z. Zhang, “A new data fusion model of intrusion detection-IDSFP,” in Third international conference on
Parallel and Distributed Processing and Applications, Berlin, Heidelberg: Springer, 2005, pp. 371–382, doi: 10.1007/11576235_40.
[4] L. Cao, J. Tian, and W. Jiang, “Information fusion technology and its application to fire automatic control system of intelligent
building,” in 2007 International Conference on Information Acquisition (ICIA), 2007, pp. 445–450, doi: 10.1109/ICIA.2007.4295775.
[5] L. Zhang, H. Leung, and K. Chan, “Information fusion based smart home control system and its application,” IEEE Transactions
on Consumer Electronics, vol. 54, no. 3, pp. 1157–1165, 2008, doi: 10.1109/TCE.2008.4637601.
[6] Y. Xiao and Z. Shi, “Application of multi-sensor data fusion technology in target recognition,” in 2011 3rd International
Conference on Advanced Computer Control, ICACC 2011, 2011, pp. 441–444, doi: 10.1109/ICACC.2011.6016449.
[7] H. Xiangdong and W. Xue, “Application of fuzzy data fusion in multi-sensor fire monitoring,” in 2012 International Symposium on
Instrumentation & Measurement, Sensor Network and Automation (IMSNA), 2012, pp. 157–159, doi: 10.1109/MSNA.2012.6324537.
[8] Y. Chen, C. Li, P. Ghamisi, X. Jia and Y. Gu, “Deep fusion of remote sensing data for accurate classification,” IEEE Geoscience
and Remote Sensing Letters, vol. 14, no. 8, pp. 1253-1257, 2017, doi: 10.1109/LGRS.2017.2704625.
[9] Z. Yan, X. Jing, and W. Pedrycz, “Fusing and mining opinions for reputation generation,” Information Fusion, vol. 36, pp. 172–
184, 2017, doi: 10.1016/j.inffus.2016.11.011.
[10] J. Liu, Z. Yan, and L. T. Yang, “Fusion–an aide to data mining in internet of things,” Information Fusion, vol. 23, no. 2015, pp.
1–2, May 2015, doi: 10.1016/j.inffus.2014.08.001.
[11] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the KDD CUP 99 data set,” in IEEE Symposium on
Computational Intelligence for Security and Defense Applications, CISDA, 2009, pp. 1–6, doi: 10.1109/CISDA.2009.5356528.
[12] N. Moustafa and J. Slay, “UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data
set),” in Military Communications and Information Systems Conference (MilCIS), 2015, pp. 1–6, doi: 10.1109/MilCIS.2015.7348942.
[13] A. Binbusayyis and T. Vaiyapuri, “Identifying and benchmarking key features for cyber intrusion detection: An ensemble
Feature level fusion of multi-source data for network intrusion detection (Harshitha Somashekar)
2962 ISSN: 2252-8938
BIOGRAPHIES OF AUTHORS