A Multi-View Feature Fusion Approach For Effective Malware Classification Using Deep Learning
A Multi-View Feature Fusion Approach For Effective Malware Classification Using Deep Learning
Keywords: The number of malware infected machines from all over the world has been growing day by day. New
Cybersecurity malware variants appear in the wild to evade the malware detection and classification systems and may
Cybercrime infect with ransomware or crypto miners for adversary financial gain. A recent colonial pipeline ransomware
Malware analysis
attack is an example of these attacks that impacted daily human activities, and the victim had to pay
Portable Executable
ransom to restore their operations. Windows-based systems are the most adopted systems across different
Multi-view
Feature fusion
industries for running applications. They are prone to get targeted by installing the malware. In this paper,
Machine learning we propose a Deep Learning (DL)-based Convolutional Neural Network (CNN) model to perform the malware
Deep Learning classification on Portable Executable (PE) binary files using the fusion feature set approach. We present an
Convolution Neural Network extensive performance evaluation of various DL model architecture and Machine Learning (ML) classifier i.e.
Support Vector Machine (SVM), on multi-aspect feature sets covering the static, dynamic, and image features
to select the proposed CNN model. We further leverage the CNN-based architecture for effective classification
of the malware using different combinations of feature sets and compare the results with the best-performed
individual feature set. Our performance evaluation of the proposed model shows that the model classifies
the malware or benign files with an accuracy of 97% when using fusion feature sets. The proposed model
is robust and generalizable and showed similar performances on completely unseen two malware datasets.
In addition, the embedding features of the CNN model are visualized, and various visualization methods are
employed to understand the characteristics of the datasets. Further, large-scale learning and stacked classifiers
were employed after the penultimate layer to enhance the CNN classification performance.
1. Introduction techniques are able to detect the known malware in the wild with
defined rules. But, it has a disadvantage of difficulty in identifying
The advancement in Information and Communication Technology new malware variants and requires a constant update of signature
(ICT) has made everyone connected virtually and rely on computer rules. Heuristic-based detection systems monitor the anomalies in the
systems. On the other side, an adversary may take advantage of the
system, network, and user behavior to detect the malware. It may detect
machine’s weaknesses in the device and compromise the computer ma-
the new malware attacks and malware variants possessing at least
chines with malware . Novel malware creation and hiding approaches
are emerging to constantly evade malware detection and successfully some behavioral characteristics. However, these systems will generate
install the malware on victim machines. Recent trends in malware more false-positive alerts and require manpower to tune the alerts.
threats show that more than 8 billion malware attacks are performed Additionally, it may not be able to detect Advanced Persistent Threat
per year [1]; 560,000 instances of new malware are being created and (APT) and well-crafted malware. Sandboxing technique is applied to
detected every day, and 1 in every four machines is likely infected monitor the real-time behavior of malware for detection. Although
with malware in US [2]. The working from the home situation due to we can determine the impact of the malware on a machine using
COVID-19 may even pose more risk to malware infection, as the devices sandboxing, An adversary may use sophisticated techniques to not
are exposed to the public internet with few security controls in a home execute the malicious executable files in a virtualized sandboxing
network environment.
environment and avoid the detection. In recent years, next-generation
The literature’s well-known malware detection techniques include
anti-malware detection systems leveraging ML and DL techniques to
signature-based, heuristic-based, and sandboxing . Signature-based
∗ Corresponding author.
E-mail addresses: Raj.chaganti2@gmail.com (R. Chaganti), vravi@pmu.edu.sa (V. Ravi), tpham@pmu.edu.sa (T.D. Pham).
https://doi.org/10.1016/j.jisa.2022.103402
identify advanced malware attacks. ML solutions require feature se- 2. Literature survey
lection and large scale data. In the context of malware detection,
some features can be extracted from malware binaries including PE Malware detection has been a vivid area of research and various ap-
header, PE section details, byte histogram, opcode n-gram, API call proaches were proposed for malware detection. The detailed study and
sequences but not limited [3]. There are mainly two ways to extract analysis on malware detection, particularly, windows executable mal-
the malware features, those are using static and dynamic malware ware detection are described in [3,21,22]. Gibert et al. in [3] performed
analysis. Static features extract meaningful information with regard to a comprehensive survey of the malware detection and classification
the composition details of the file. PE section, PE import functions, using ML techniques and also discussed recent trends leveraging DL ap-
PE header, byte, and Opcode histograms are commonly used in static proaches to defend against malware attacks. The survey is categorized
malware analysis [4]. But, these features may overlook essential infor- based on the PE feature types extracted from static or dynamic analysis
mation related to sophisticated malware techniques like Obfuscation,
and various ML/DL techniques used to detect the malware by utilizing
Metamorphism, Polymorphism, and Oligomorphic code used to avoid
various feature types. Abusitta et al. in [21] designed a framework for
detection. Dynamic malware analysis may capture behavioral features
analyzing the existing malware classification and composition analysis
of the executable file for malware detection and classification. The most
and also presented a review of the articles describing the features and
commonly used feature set in dynamic malware analysis is API call
sequence [5] because it captures the interaction of the binary with algorithms used in those articles. In [22], the authors classified mal-
various system resources and also sees the intention of the malware ware detection approaches into signature, heuristics, behavior, model
creation. Researchers also utilized the combination of the static and checking, DL, Internet of Things (IoT)-based, cloud computing-based,
dynamic features called hybrid analysis to accurately detect threats and and mobile-based. Overall, all these survey papers emphasize that
improve the performance [6–8]. In recent trends, DL models gained malware detection using various techniques mentioned in the prior art
popularity to use in cybersecurity and particularly, in malware detec- still has challenges to accurately detect the malware and in particular
tion applications [9–11] because they have shown good performance, production environments, as the sophistication of malware creation by
feature engineering steps can be avoided, and domain knowledge not adversary ever-changing and adversary always find new ways to evade
needed for data analysts. DL models could take the whole binary raw the existing detection models.
bytes as input features to obtain the best results [12]. We have also
witnessed that DL models may achieve better performance than ML 2.1. Static features
techniques for malware classification [13,14]. In order to best utilize DL
capabilities seen in computer vision applications, image-based feature Malware feature data collection is one of the primary and important
extraction is widely used as a third way of retrieving the features tasks . in the process of applying ML or DL models for malware
from executable files. In image-based feature extraction, the malware detection and classification. The quality of the feature data helps to
executable files can be transformed as a color or grayscale image; best train the models and obtain better performance by distinguishing
represent the color or grayscale image features in input form, and then the benign and malicious executable files. There are a number of tools
apply DL models like CNN to achieve performance results [10,15,16]. available for free to perform PE static file analysis and extract the
These image-based malware detection and classification methods are features like PE imported functions, PE headers, PE section details,
platform independent and may detect the packed malware. It is also
byte N-grams, opcode N-grams, and strings [23]. Raff et al. in [12]
evident that image-based feature is used in conjunction with static
presented a neural network solution to feed the whole PE executable
features [17,18] to improve the performance.
raw bytes as an input sequence. The paper focused on addressing
Most of the existing literature work shows that the malware detec-
the challenges that arise when the models learn the long sequence of
tion using a single feature set either API calls [19,20] or other PE
raw bytes with over two million time steps to produce the malware
features extensively, and few works using the combination of the two
feature sets like hybrid malware analysis. However, there is no prior classification output. Their proposed architecture ‘‘MalConv’’ used a
art detailed investigation and analysis of the malware detection of the CNN to apply on long sequences. The results obtained showed that
various feature sets and the fusion feature sets utilizing multiple aspects neural network architectures have the potential to be used for accurate
of a malware file. In this work, we leverage the static features PE sec- malware classification, even though it has challenges to handle long
tion details, PE import functions, dynamic feature API call sequences, sequences of data like executable static raw data. Vinayakumar et al.
and binary image features as feature sets to perform malware detection in [13] proposed ‘‘DeepMalNet’’ 10 hidden layer Deep Neural Network
and classification using various DL approaches and present the best (DNN) architecture to classify the given file is malware or benign
performed as the proposed model for our approach after extensive using the PE static file information, header information, imported and
performance analysis and validation. The major contributions of the exported functions, section information and format agnostic features
proposed work are as follows byte histogram, byte entropy histogram, string information. The paper
showed that DNN with considerable hidden layers performed well
• Propose multi-view feature fusion-based feature selection ap-
compared to the classical ML algorithms.
proach for effective and robust malware detection and classifi-
In [24], the authors described a low resource and highly accurate
cation.
DNN-based malware classification method. The static features byte
• Present detailed analysis and investigation of various DL model
histogram, entropy histogram, PE metadata features, and PE import
architectures and ML techniques for malware classification.
features are extracted as 256 feature vectors each from the executable
• Performance evaluation of all the models on static, dynamic, and
file. Then, the four 256 feature vectors combined to form 1,024 feature
image feature sets, and the combinations of the fusion feature sets
are discussed. vectors as an input set. The proposed two hidden layer DNN model
• Describe the layer wise performance of the proposed DL CNN achieved low false positive rates and can be scalable to deploy in
model and feature visualization using t-distributed stochastic a cloud analytics platform. Azeez et al. [25] presented an ensemble
neighbor embedding (t-SNE) dimensional reduction. learning based malware detection using PE static features. The first
• Performance evaluation of penultimate layer features of the pro- stage classifier consists of a stacked ensemble of fully connected CNN.
posed model is shown with large-scale learning and The final stage layer is tested with multiple machine learning algo-
meta-classifiers. rithms for choosing the best performance model. The obtained results
• Various experiments are included on two unseen malware showed that an ensemble of seven neural networks in the first stage
datasets to show that the proposed method for malware classi- and an ExtraTree classifier in the final stage achieved the best results
fication is robust and generalizable across unseen malware data for malware. Overall, these papers conclude that the carefully selected
samples. static features along with applying DL models may result in fruitful
2
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Table 1
Prior art work comparison of feature selection and the proposed methods for malware classification.
Authors PE Section PE Import PE API PE Image Feature info Feature fusion Method
malware classification performance results. But, the static features Vasan et al. [15] proposed an ensemble CNN architecture for
standalone may not detect the sophisticated malware. For example, image-based malware classification. The authors employed VGG16 and
the malware contains encrypted code, and code that can be decrypted ResNet-50 network architecture and also applied the softmax as well as
during execution is hard to detect using PE static features. one-vs-all multiclass SVM’s to ensemble the output and predict the mal-
ware classification. Overall, These image-based malware classification
2.2. Dynamic features results show that the DL models in particular CNN architectures are
extensively used for malware classification and achieved good results.
The behavioral characteristics of an executable file can be useful But, a careful selection of architecture is required to design a less
for the accurate detection of malware executables, in particular well- time, low resource, and high performing data model when selecting
crafted when the well crafted malware functionality is hidden in files. the image features of malware.
So, it is crucial to capture the behavioral features while applying ML Table 1 presents the state-of-the-art papers focused on using static,
or DL models for malware classification. But, the dynamic analysis dynamic, image features, and hybrid features and their proposed meth-
of an executable requires to be run in an isolated or virtual envi- ods along with the comparison of the proposed feature fusion ap-
ronment to extract behavioral features. Li et al. in [26] presented an proach for windows malware classification. These prior arts show that
API call sequence-based malware classification using Long Short-Term most of the works considered single view feature sets either static,
Memory (LSTM) and Gated Recurrent Unit (GRU) models. The API dynamic, or image-based to solve the malware classification problems,
call sequences for the malware samples are generated using a cuckoo and none of the existing state-of-the-work approaches have considered
sandbox environment. The performance results presented in the paper all the four feature sets such as PE Section, PE Import, PE API, and
show that precision and recall for the malware families classification is PE Image for capturing the multiple features of the malware for clas-
75%, which is not impressive. sification. The importance of considering all these four feature sets is
In [9], the authors proposed a DL model using dynamic features evaluated and discussed in detail in this work.
as part of the intelligent malware detection approach. The dynamic
The existing feature extraction techniques static, dynamic, and im-
features packets sent/received, number of processes running, bytes
age representation from an executable file can represent certain char-
sent/received CPU and memory usage [28] were considered to perform
acteristics of an executable file. It is very likely to evade malware
the malware classification using DNN and CNN. The results concluded
detection if trying to detect advanced persistent threats using a specific
that the DL models performed better than ML models and CNN can
feature set.
more accurately classify the malware than DNN. Zhang et al. In [27]
implemented a feature engineering method to extract the meaningful
2.4. Multiview feature fusion approach
information from the API call sequences. Then applied two GatedCNNs
and BiLSTM models for malware classification. Hashing tricks were
Multiview feature based learning approaches received attention for
used on API name categories, and arguments to create dataset features.
use in cybersecurity applications because they improve the model per-
The model was evaluated using API call sequences as data features
formance compared to the single view feature learning approaches [31,
and compared with the key results. Our survey shows that API
call sequence [8,29,30] is widely used as a key dynamic feature for 32]. A single view feature learning may cover a specific view of the data
identifying the malware using ML or DL solutions. samples and may not effectively learn all the essential data patterns
information within the training datasets. In multi view feature learning,
2.3. Image representation each feature dataset view presents a unique semantic perspective of
the data sample. The combination of multi view features can learn the
In general, neural networks have been extensively used for Com- different semantic perspectives of the data sample. It will improve the
puter Vision and Natural Language Processing (NLP) applications. Re- model accuracy. Some works explored the application of multi view
searchers also explore effective malware detection solutions by con- based feature learning for malware detection, and classification [31,
verting the executable files into an image and then applying neural 32]. Table 2 shows the comparison of the state of the art multi view
network models. Sun et al. in [11] presented a feature image generation feature based works for malware classification and Advanced persistent
approach and applied CNN to classify the feature images. The static threats family attribution [31–35]. The articles [31–33] applied the
code is fed to a recurrent neural network (RNN) to generate predictive multi view feature datasets to detect the android malware and malware
code and then performs a fusion of the predictive code from RNN family classification. The Opcodes, bytecode, header, API calls, and
and static code to generate feature images using minhash. These feature permissions feature datasets are commonly used to capture the multi
images are trained using CNN to predict the malware feature images. view of the android malware and perform feature fusion to combine all
The authors reported that the described model achieved 99.5% when the different view features. Then, machine learning or deep learning
training to validation sample size proportion of 3:1. In [10], The models are applied to detect and classify the android malware. The
authors applied Extreme Learning Machine (ELM) and CNN techniques articles [34,35] applied a multi view feature learning approach to
to classify the malware using a grayscale represented image malware attribute the APT malware with threat groups. The opcodes, bytecode,
dataset. The reported results mention that ELM achieved comparable API call, and header information are used to build the multi view
accuracies with CNN, despite ELM randomly initializing the weights feature dataset and apply machine learning or deep learning for APT
between input and hidden layers. threat attribution. However, the multi view feature approach for the
3
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Table 2
Multiview malware detection or threats attribution prior art comparison.
Authors Number of views Features Operating system Feature fusion ML/DL technique Application
Appice et al. [31] 10 Static features Andriod Yes Clustering, Random Forest Malware detection
Millar et al. [32] 4 Opcodes, permissions, Arbitary API package, Proprietary Android API packages Andriod Yes CNN Malware Detection
Darabian et al. [33] 2 or 3 OpCodes, ByteCodes, header information, permission, attacker’s intent and API call Windows, IoT and Android Yes KNN, RF, Adaboost, DT, MLP, SVM Malware detection
Haddadpajouh et al. [34] 12 Opcode, Bytecode, SystemCall and Header Generic Yes Clustering and Decision Tree Cyber threat attribution
Sahoo [35] 11 Opcode, Bytecode, and Header features Generic Yes SVM, Decision Tree, KNN, MLP, and Fair Clustering Cyber threat attribution
Proposed 4 PE section, PE import, PE API, PE image Windows Yes CNN Malware Detection
Table 3
1D-CNN and 2D-CNN based malware classification comparison.
Author 1D CNN 2D CNN Technique Application Performance
Sun et al. [11] – Yes Malware image features Image based malware family classification 92-99.5% accuracy
Huang et al. [6] – Yes Malware static and dynamic features as an image image based malware binary classification 94.7% accuracy
Miller et al. [32] Yes – Multi view feature fusion Zero day malware classification 91% accuracy
Zhnag et al. [27] Yes – API features with Gated CNN Malware binary classification 98.71±0.17 AUC
Azeez et al [25] Yes – CNN as first stage classifier Malware binary classification 97.7% accuracy
Nisa et al. [7] – Yes CNN based AlexNet and Inception-V3 Malware family classification 99.3% accuracy
Cui et al. [16] – Yes Malware image features Image based malware family Classification 94.5% accuracy
Proposed Yes – Multi view feature fusion Malware binary classification 97.7% accuracy
windows executable file malware detection is not been extensively used distinguish the malware executable file from the benign executable;
in the prior art study. Furthermore, the multi view of the malware adapt the best performed DL-based model CNN on the individual
feature sets, including the static, dynamic, and image feature fusion feature datasets and implement the proposed best performed model
based study is not seen in state-of-the-art. We address the windows CNN to evaluate the performance on the feature fusion dataset as well
malware detection using multi view feature datasets. The feature fusion as perform the detailed comparative analysis along with prominent ML
and the machine learning or deep learning methods are used to detect and DL models used in malware detection and classification.
the malware accurately.
3.1. Proposed deep learning architecture
2.5. 1D CNN vs 2D CNN in CyberSecurity applications
The optimal algorithm selection is pertinent for the accurate de-
Convolutional Neural networks (CNN) are commonly used for image tection of malware executables. In order to estimate the feature en-
classification problems. The image is represented in two-dimensional gineering based supervised learning model, we have considered the
form, and the convolutional layers followed by pooling layers are best performed model in the past, SVM. As feature engineering tends
applied to the images for image classification. In the context of mal- to be time-consuming and tedious, DL models were seen to be used
ware classification, the malware file sequence of bytes is represented extensively in malware categorization, and performance results were
in the image form. Some prior art works converted malware files achieved as per the expectations in the state-of-the-art. Hence, the well-
as an image and applied CNN to classify the given file is malware known DL models used in malware detection and classification with
or not [6,10,11,15,36]. Table 3 compares the one-dimensional and inputting different types of features are selected for our evaluation.
two-dimensional CNN techniques used to perform the malware binary Those models are DNN, CNN, and LSTM, and the combination of CNN
classification and malware family classification. Some works leverage and LSTM. Furthermore, to select the optimal hyperparameters for
the pretrained image classification models like Alexnet and InceptionV3 DL models, different versions of the models are tested by changing
to perform the malware classification [7]. The static and dynamic the hidden layers between the input and output layers. As one of
features are usually represented in the one-dimensional dimensional our objectives is to show the performance comparison of the single
feature vector. The one-dimensional CNN can be used to capture the view of feature sets and the combined multi-view of fusion feature set,
essential information from the one-dimensional feature vector. The we have selected the best performed model, which showed optimal
one-dimensional CNN was applied to extract the features and classify performance on the majority of the individual feature sets and also
the malware in the prior art [25,27,32]. State of the art indicates showed comparable performance on non-optimal result feature sets. On
that both one-dimensional and two-dimensional CNN’s used to classify the basis of our extensive performance evaluation with a different set of
the malware. The selection of the CNN depends on the feature type DL models on each individual feature set, we opted suitable CNN model
and malware file representation dimensions. The image representing for our feature sets elected from static, dynamic analysis, and binary to
malware or benignware file samples is further processed to denote it as image conversion methods methods. Consequently, the CNN model
a one-dimensional image feature vector in our work. So, we use one- has been used for our feature fusion combination dataset evaluation to
dimensional CNN to process the one-dimensional feature fusion input perform the malware classification.
and classify the malware in our work. The pseudocode for the CNN model used in our feature fusion
Our work is motivated to some extent from the prior art papers [37– approach is shown in Algorithm 1. Here, we assume that the four
39]. But, we focus on combining the multiple aspects of an executable feature sets are already preprocessed to contain the same malware
file for robust malware detection and also performing an extensive sample features in each feature set and not presented in pseudocode. In
evaluation of the fusion feature sets and individual feature sets using each epoch 1 to 𝑒, the fusion feature samples 1 to 𝑘 are passed through
various DL models and SVM to unravel the advantages of proposed the CNN model to classify whether the sample is malware or not.
multi-view feature fusion set approach.
3.2. Description of the deep learning model for our approach
3. Proposed approach
The optimal CNN architecture obtained after careful evaluation of
The proposed DL-based multi-view feature fusion malware classi- the performance results is illustrated in Fig. 1 to process the combined
fication approach to gathering static features PE section details, PE fusion features of the datasets. As represented in Fig. 1, the input
importing API function calls, dynamic features API call sequences, and layer comprises two static feature PE section details, PE imported API
binary transformed image features from malware samples; fusing all functions datasets, Dynamic feature PE API call sequence dataset, and
the selected features to incorporate multiple feature characteristics to PE binary transformed to image feature dataset taken to cover the
4
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Algorithm 1: CNN based feature fusion algorithm for malware together to obtain the fusion feature dataset having 2,128 unique
detection features from each dataset. The order of the feature fusion is also
Input: PE section: {𝑠1 , 𝑠2 , … , 𝑠𝑚 } PE import: {𝑖1 , 𝑖2 , … , 𝑖𝑛 } PE API: needed to train the CNN model and ensure that the CNN model learns
{𝑎1 , 𝑎2 , … , 𝑎𝑝 } PE image: {𝑖𝑚1 , 𝑖𝑚2 , … , 𝑖𝑞 }, the features when training the datasets. So, a consistent feature fusion
fusionfeature=𝐹 𝐹1 , … , 𝐹 𝐹𝑡 , m+n+p+q=t, kernelfilter=k, order should be used to train and test the models. Our motivation for
datasetsamples=d, epoch:e,denseunits:x, this work is that the combined features can complement each other’s
filterlength=l,poolsize=b advantage and achieve better performance results, which is described
Output: Labels {𝑀𝑎𝑙𝑤𝑎𝑟𝑒, 𝐵𝑒𝑛𝑖𝑔𝑛} in more detail in the results discussion Section 5. We have fed the
𝐹 𝐹1 , … , 𝐹 𝐹𝑡 ← Featurefusion({𝑠1 , 𝑠2 , … , 𝑠𝑚 , 𝑖1 , 𝑖2 , … , 𝑖𝑛 , fusion feature dataset as an input to the best results achieved CNN
𝑎1 , 𝑎2 , … , 𝑎𝑝 , 𝑖𝑚1 , 𝑖𝑚2 , … , 𝑖𝑚𝑞 }); model to test our hypothesis. The selection of the CNN model is based
on the best performance achieved among all the models evaluated when
𝐹 𝑆1 , … , 𝐹 𝑆𝑡 ← Preprocessing(𝐹 𝐹1 , … , 𝐹 𝐹𝑡 ) ;
the four individual feature datasets are applied as an input separately.
for 𝑒𝑝𝑜𝑐ℎ ← 1 to 𝑒 do
The input data pass through multiple layers to extract the meaningful
for 𝑗 ← 1 to 𝑑 do
information in the CNN model, and the detailed description is explained
𝐶11𝑗1 , … , 𝐶1𝑡𝑗𝑘 ← Convolution1D(𝑘, 𝑙,′ 𝑟𝑒𝑙𝑢′ , 𝐹 𝑆1𝑗 , … , 𝐹 𝑆𝑡𝑗 );
as follows.
𝑀21𝑗1 , … , 𝑀2𝑡∕ 𝑗𝑘 ← Maxpooling1D(b, 𝐶11𝑗1 , … , 𝐶1𝑡𝑗𝑘 );
2 Convolution Layer: The essential component convolution filter
𝐹 31𝑗1 , … , 𝐹 3𝑡∕ 𝑗𝑘 ← Flatten(𝑀21𝑗1 , … , 𝑀2𝑡∕ 𝑗𝑘 ); in CNN is applied to the input feature fusion dataset. As the final
2 2
𝐷31𝑗 , … , 𝐷3𝑥𝑗 ← Dense(x, 𝐹 31𝑗1 , … , 𝐹 3𝑡∕ 𝑗𝑘 );
executable binary is represented in a 1-dimensional vector in our
2
datasets, the 1-dimensional convolution layer is chosen in the model.
𝐷51𝑗 , … , 𝐷5𝑥∕ 𝑗 ← Dropout( 𝐷41𝑗 , … , 𝐷4𝑥𝑗 ) ;
2 The convolution filters 64 were applied to the input with kernel size
𝑂1 =Sigmoid(𝐷51 𝑗, … , 𝐷5𝑥∕ 𝑗 ); 3 to map the features. The activation functions 𝑅𝑒𝐿𝑈 is opted to add
2
end nonlinearity.
end Pooling Layer: A pooling map of size two is applied to the as-
sembled feature maps obtained from the convolution layer. The
pooling layer helps to reduce the dimensionality and the amount of
time to perform computations, and the number of learning parameters
multi-view of executable files. These input datasets were clubbed . Furthermore, we have applied flatten function to reduce the output
data of the pooling layer into 1-dimensional data.
5
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Dense layer: After pooling layer and flatten to 1-dimensional data, Table 4
The four dataset malicious and benign file sample statistics.
it passes through the fully connected layer to perform malware de-
Dataset Raw data Random selection
tection. The fully connected layer contains 128 units, including the
activation function as 𝑅𝑒𝐿𝑈 . It combines all the features from the input Malware Benign Malware Benign
to obtain optimal features. Then, the dropout layer randomly selects PE Section 38,442 875 1,500 875
PE Import 38,442 876 1,500 875
some values to 0 based on the chosen value between 0 and 1. We have PE API 42,797 1,079 1,500 875
considered the value 0.5 to drop half of the fully connected layer output PE Images 38,843 875 1,500 875
data for redundancy. Subsequently, the 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 function is applied to
obtain the normalized probability distribution values between 0 and 1
to classify whether the feature dataset sample is benign or malware.
from portableapps.com and the 32-bit Windows 7 ultimate directory.
4. Datasets The number of publicly available benign PE files is limited. The
malware samples were extracted from virtustotal.com. So, the con-
In this section, we describe the details of the datasets considered for sidered dataset is highly imbalanced, with around 38000 malware and
our malware classification experiments and performance analysis, and 875 benign PE files. We have chosen 1500 malware samples for our
also describe the process used to collect the datasets. experiments because the malware versus benign file proportions should
Four different feature datasets were selected for this malware be balanced, align with the computing resources available for our ex-
classification study, and all these datasets were generated using PE periments, and should not overfit the model for malware classification.
malware and benign file samples. The malicious applications down- Our selection of 1500 samples ensures that the training model is not
loaded from viruatotal.com and benign applications collected from overfitting or underfitting and will not have a drastic impact on the test
windows7 x86 directories and portableapps.com [40]. These datasets malware classification performance. We also tested our model on the
have a naming convention, which starts with ‘‘PE’’ and ends with unknown malware samples to validate our training model and evaluate
a feature name extracted from either the image or static or dynamic the performance.
analysis of the executable. For instance, the dataset ‘‘PE API’’ represents The detailed four dataset malware and benign file sample statistics
the API calls performed by the PE application when executed in a are shown in Table 4.
virtual environment. A detail description of these four datasets is The final datasets having unique 2,375 file samples each is ran-
given below. We follow the same convention for representing datasets domly split into training and test dataset samples with 73% and 27%
throughout this paper. proportions to perform ML and DL performance evaluation and com-
The dataset 1 ‘‘PE section’’ data consists of the PE file section parative analysis; the total number of training and testing data samples
information. These static data features are collected by executing the are described in Table 5. Also, the table contains the total number
malware samples in a cuckoo sandbox environment and saving PE of malware and benign samples in training and test data for all four
sections report [40]. 4 PE section header features are extracted from datasets.
each binary sample. The collected data contain unique 38,442 PE
malware samples and 875 benign samples. 5. Results and discussion
The dataset 2 ‘‘PE import’’ data comprises the top 1,000 imported
functions extracted by running the PE in the Cuckoo sandbox envi- The experiments were performed using software application stack
ronment and generating the static pe_import report [41]. The dataset ML library Scikit-learn1 and the DL API Keras2 with TensorFlow3 as
contains unique 38,442 malicious file samples and 876 benign files. backend in python and the software programs run in virtual machine
The dataset 3 ‘‘PE API’’ was generated by running the malicious 64-bit Ubuntu operating system 20.04 LTS with configuration 4 GB
and benign applications in the cuckoo sandbox virtual environment RAM memory and Intel Core i5-4210U CPU@1.70 GHz.
and saving the first 100 non-repeated consecutive API calls from the
parent using the ‘‘calls’’ feature in cuckoo sandbox [42]. This dataset 5.1. Evaluation metrics
comprises 42,797 malware API call sequences and 1,079 benign API
call sequences. The performance evaluation of the data analytics models is repre-
The dataset 4 ‘‘PE image’’ contains PE malware or benign file sented in the form of metrics. One of the columns in the datasets is
represent in image form. The nearest neighbor interpolation algorithm labeled as ‘‘benign’’ or ‘‘malware’’ for malware classification. Table 6
is applied to the PE application raw byte stream to convert into the 32 shows the malware classification metrics in the form of a confusion
𝑥 32 grayscale [43]. The image is then transformed as a 1,024 bytes matrix.
vector. This dataset comprises 38,443 malicious file samples and 875 True positive (TP) is measured as the correct detection of the
benign file samples. malware p when the malware p′ is present. False positive (FP) is deter-
All these four datasets can be compared or correlated using the mined as the incorrect classification as malware P when no malware n′
column file hash value. Each malware file has a unique hash value. For is present. False negative (FN) is measured as the incorrect classifica-
instance, a malware sample feature datasets PE section, PE import, PE tion as legitimate n when the malware P′ is present. True negative (TN)
API, and PE images combined using the unique hash value so that the is determined as the correct classification as the legitimate 𝑁 when no
same malware static, dynamic, and image features are combined as the malware N′ is present.
fusion features for the malware. The process is repeated for the benign Our work uses the following evaluation metrics for comparative
sample PE files as well. So, each malware or benign file contains 2128 analysis of ML and DL models.
fusion features, which represents the same malware or benign file. 𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1)
The feature fusion dataset also includes a column ‘‘Label’’. The ‘‘Label’’ 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
column is 0 for benign samples and 1 for malware samples. In this way,
𝑇𝑃
we make sure the malware or benign sample file associated with four 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃 𝑅 = (2)
𝑇𝑃 + 𝐹𝑃
dataset features merged correctly and labeled correctly as malware or
benign file.
Then, we randomly selected 1,500 malware samples from a pool 1
https://scikit-learn.org/stable/
of more than 38000 malware samples to balance the malware and 2
https://keras.io/
3
benign file proportions. The benign PE executable files were taken https://www.tensorflow.org/
6
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Table 5
The train and test dataset malware and benign sample statistics.
Dataset Train Test
Malware Benign Total Malware Benign Total
PE Section 1,100 634 1,734 400 242 642
PE Import 1,100 634 1,734 400 242 642
PE API 1,100 634 1,734 400 242 642
PE Images 1,100 634 1,734 400 242 642
Total 4,400 2,536 6,936 1,600 968 2,568
7
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Table 7
Hyperparameters used in the DL models for performance evaluation.
Parameters DNN1 DNN2 DNN3 DNN4 CNN CNN-LSTM LSTM1 LSTM2 LSTM3 LSTM4
Dense 1024,1 1024,768,1 1024,768,512,1 1024,768,512,1 128,1 1 1 1 1 1
Dropout 0.01 0.01,0.01 0.01,0.01,0.01,0.01 0.01,0.01,0.01,0.01 0.5 0.1 0.1 0.1,0.1 0.1,0.1,0.1 0.1,0.1,0.1,0.1
Activation ReLU ReLU,ReLU ReLU,ReLU,ReLU ReLU,ReLU,ReLU,ReLU ReLU, ReLU ReLU – – – –
Batchsize 64 64 64 64 – – 32 32 32 32
epochs 200 200 200 200 200 200 200 200 200 200
Poolsize – – – – 2 2 – – – –
Kernelsize – – – – 3 3 – – – –
Number of Filters – – – – 64 64 – – – –
Fig. 2. Comparison of Accuracy on All models and four datasets. Fig. 4. Comparison of the Weighted Precision for all the models.
Fig. 3. Comparison of the Macro Precision for all the models. Fig. 5. Comparison of the Macro Recall for all the models.
feature datasets evaluation. We can see that the ‘‘PE import’’ dataset increased in the LSTM model. Additionally, the LSTM4 model showed
can identify more actual malware numbers from the datasets than closely comparable performance with the CNN model. The Fig. 5 also
the other three sets for all the models considered in our evaluation . concludes that the least recall performance achieved among the four
This could indicate that extracting the static importing function calls feature sets are ‘‘PE image’’ and ‘‘PE section’’ in almost all the models
features from the binary samples may not overlook the malware and except LSTM1.
benign static data differences in samples. Further, the ‘‘PE API’’ call In general, there may be a trade-off between precision and recall
sequence extracted from datasets using dynamic analysis performed metrics when we train the model. Depending on the business require-
slightly closer to the ‘‘PE import’’ results. Based on these two fea- ment of the application, either false positives or false negatives for the
ture sets’ performance, we can construe that the interaction with the trained model can be neglected. For instance, in malware detection,
system core modules using API functions calls in windows is clearly even a single malware detection miss may have a catastrophic impact
an important aspect to be considered for distinguishing the malware on the organization. So, we always make sure the false negative should
and benign file. It is also evident that ‘‘PE API’’ datasets accurate be as low as possible. On the other hand, false positive can be over-
malware sample detection improved as the number of hidden layers looked to a certain extent, in particular, when the organization has the
8
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Fig. 6. Comparison of the Weighted Recall for all the models. Fig. 8. Comparison of the Weighted F1-Score for all the models.
Table 8
Best performed model metrics (0:Benign and 1:Malware).
Best model Class Precision Recall F1-score
PE Images
CNN-LSTM 0 0.76 0.78 0.77
1 0.87 0.85 0.86
PE API
CNN 0 0.91 0.87 0.89
1 0.92 0.94 0.93
PE Features
CNN 0 0.91 0.92 0.92
1 0.95 0.94 0.95
PE Section
DNN4 0 0.78 0.73 0.75
1 0.84 0.87 0.86
Table 9
Classification report for the ImageApi (0:Benign and 1:Attack).
Fig. 7. Comparison of the Macro F1-Macro for all the models.
Precision Recall F1-Score
0 0.85 0.86 0.85
1 0.91 0.91 0.91
workforce to analyze these alerts. However, more false positives can Macro avg 0.88 0.88 0.88
also be not tolerated and needed further retraining of the models. F1- Weighted avg 0.89 0.89 0.89
9
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Table 10 Table 13
Classification report for the ‘‘ImportSection’’ (0:Benign and 1:Attack). Classification report for the MergeAll (0:Benign and 1:Malware).
Precision Recall F1-Score Precision Recall F1-Score
0 0.91 0.94 0.92 0 0.93 0.93 0.93
1 0.96 0.95 0.95 1 0.96 0.96 0.96
Macro avg 0.94 0.94 0.94 Macro avg 0.94 0.95 0.95
Weighted avg 0.94 0.94 0.94 Weighted avg 0.95 0.95 0.95
Table 11
Classification report for the ImportSectionApi (0:Benign and 1:Attack).
Precision Recall F1-Score
0 0.95 0.94 0.94
1 0.96 0.97 0.97
Macro avg 0.96 0.95 0.96
Weighted avg 0.96 0.96 0.96
Table 12
Classification report for the ImageImportSection (0:Benign and 1:Malware).
Precision Recall F1-Score
0 0.89 0.95 0.92
1 0.97 0.93 0.95
Macro avg 0.93 0.94 0.93
Weighted avg 0.94 0.93 0.93
individual feature set ‘‘PE import’’, the performance of this fusion set
drastically improved in comparison with the least performed individual
feature set ‘‘PE image’’. This performance improvement is due to com- Fig. 9. Accuracy/loss characteristics for the individual four feature sets.
bining the ‘‘PE API’’ feature set with ‘‘PE image’’. Overall, the macro
and weighted average precision and recall for feature combination
‘‘ImageAPI’’ achieved 89% and 89%, respectively, which is much better
views of feature combinations may interpret unseen improved results.
than ‘‘PE image’’ metrics and slightly comparable to ‘‘PE API’’ features.
So, selecting the combination of the features is key to obtaining su-
Another two feature combinations, i.e, ‘‘PE import’’ and ‘‘PE sec-
perior results. This selection may depend on the number of features,
tion’’ are clubbed together, and applied this combination on the CNN
dataset size, how the features are extracted, and the expectation of the
model yields the performance results as depicted in Table 10. The performance model and model used for testing.
precision and recall for the dataset malware and benign classes on the We have also considered merging all the feature set combinations
‘‘ImportSection’’ feature set show that ‘‘ImportSection’’ yields slightly to evaluate the feature fusion approach for malware classification.
better results in comparison with the individual feature set, either ‘‘PE Table 13 presents the classification report results for the ‘‘Mergeall’’
section’’ or ‘‘PE import’’. The ‘‘ImportAPI’’ performed the same as feature combination to classify the benign and malware using the
the ‘‘PE API’’ dynamic feature in terms of accuracy . Furthermore, CNN model. The results show that the ‘‘Mergeall’’ feature set achieved
the ‘‘ImageAPI’’ fusion set average macro shows that the ‘‘ImageAPI’’ better performance than the best performed individual feature set, ‘‘PE
slightly improved the performance in comparison with ‘‘PE import’’ or import’’. These results support the previous claims that feature fusion
‘‘PE API’’ set. On the other side, the weighted average is the same for sets could perform better than individual feature sets as input features
both the fusion set and the individual feature and section feature sets. for DL-based malware classification. It is also evident that ‘‘Mergeall’’
Overall, this result shows that the fusion feature slightly improved some comfortably performed better than our evaluation’s two set feature
performance metrics in comparison to the ‘‘PE import’’ or ‘‘PE section’’ fusion combinations . However, one of the three set combinations,
features. ‘‘ImportSectionAPI’’ slightly performed better than ‘‘Mergeall’’. We be-
Table 11 illustrates the performance classification report results for lieve that the addition of the feature set ‘‘PE image’’, which is seen
the fusion set combination of ‘‘PE image’’, ‘‘PE import’’ and ‘‘PE section to be almost least performed in individual feature set evaluations, to
features’’. As shown in the Table 11, the malware and benign class the ‘‘ImportSectionAPI’’ feature may incur the misclassification of a
classification Precision, Recall, and F1-Score for the fusion set of 3 few samples and hence ‘‘Mergeall’’ overall performance not better than
feature combinations ‘‘ImportSectionAPI’’ outperformed the two fea- feature fusion set ‘‘ImportSectionAPI’’.
ture combinations ‘‘ImportSection’’ and ‘‘ImageAPI’’ for both the macro Fig. 9 illustrates the training accuracy and loss performances for
and weighted average cases. This clearly shows that our hypothesis on the four individual feature sets when the epoch varies from 1 to 200
multi-view of the features can enhance the performance and improve using the CNN model. As shown in the Fig. 9, The ‘‘PE API’’ feature
the accurate classification ability for malware classification. In addi- set achieved training accuracy of almost 99% when the epoch reached
tion, we can also see that ‘‘ImportSectionAPI’’ feature set combination 70, followed by swinging around 98% and 100% until the epoch
performance significantly improved compared to the best individual reaches 100 and then settled down to 100%. The ‘‘PE image’’ feature
feature set ‘‘PE import’’ evaluated on the best performed CNN model set achieved 96% accuracy when the epoch was around 65 and then
among all the models considered for our experiments. fluctuate between 97% and 98% until the 200 epoch was completed.
The combination of the ‘‘PE image’’, ‘‘PE import’’, and ‘‘PE sec- The ‘‘PE import’’ feature set accuracy quickly converges to 95% nearly
tion’’ feature sets is also considered as one of the fusion sets for our at epoch 15 and then steadily maintained the same accuracy until
evaluation. Table 12 shows the performance classification report for complete all the epochs. In contrary to those three feature sets, the
the ‘‘ImageImportSection’’ fusion feature set. Even though the overall ‘‘PE section’’ feature set training accuracy is poor and able to achieve
performance of the ‘‘ImageImportSection’’ is not better than ‘‘Import- well below 80% by the end of all epochs. The few features in ‘‘PE
SectionAPI’’, the precision value for the malware class is slightly higher section’’ have made it difficult to achieve better training accuracy.
than the three fusion set ‘‘ImportSectionAPI’’. This shows that different The loss curves for all these four feature sets follow the downwards
10
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Fig. 10. Accuracy/loss characteristics for the four fusion feature sets. Fig. 11. Accuracy/loss characteristics for the best performed feature set.
trend in accordance with the accuracy achieved by those feature sets. shown in Fig. 13, the DL CNN model and the classical ML model
Overall, we can conclude that the ‘‘PE API’’ and ‘‘PE import’’ feature achieved comparable performance, and interestingly, both the models
sets are trained with good accuracy and this can be an indication of obtained Area Under Curve (AUC) value of 0.98. Thus, it emphasizes
these feature sets did perform well on the test dataset. However, the that the combination of feature sets taken from different aspects of
performances of the proposed models on ‘‘PE image’’ can be enhanced the dataset samples would enable the classical SVM model to perform
by including the CNN-based pretrained models [15]. In the future, we compared to the proposed CNN model. We also note that to what extent
are planning to create a big malware data of ‘‘PE image’’ and employ the fusion feature datasets can impact the performance of several ML
a CNN-based pretrained model to achieve optimal performance. Since models is out of the scope and can be considered as future work.
the number of data samples is less, the CNN based-pretrained models
are not explored in this study. 5.2. t-SNE feature visualization
The Fig. 10 depicts the training accuracy and loss curves for the
two and three feature set fusion combinations when the CNN model t-SNE is helpful when we want to visualize the high dimensional
is applied to the training datasets. We can see that the three feature data in two or three dimensions. In general, DL models act as a black
fusion sets ‘‘ImageFeatureSection’’ and ‘‘FeatureSectionAPI’’ converged box; we have limited visibility of the learning ability of hidden layers
faster to achieve better training accuracy than the two feature fu- and visualization of the results for a better understanding of the DL
sion sets ‘‘ImageAPI’’ and ‘‘FeatureSection’’. For instance, The three model performance. We have obtained the penultimate layer features of
feature fusion sets achieved 100% when the epoch reached 60, and the proposed CNN model in our feature fusion approach. These features
then ‘‘ImageFeatureSection’’ steadily maintained afterward until the are given as input to the t-SNE for the two-dimensional representation,
epoch reached 200. At the same time, ‘‘FeatureSectionAPI’’ had some and the obtained t-SNE plot is shown in Fig. 14. Fig. 14 indicates that
downward spikes until 125 before closing 100% at epoch 200. But, the most of the malware and benign sample data are formed in different
two feature sets accuracy never crosses 96% until epoch 75 and then
clusters, and slight overlapping of the malware data with benign feature
settles down to maintain steady accuracies. Overall, these results could
data can be seen in the plot. Overall, our model’s clear distinction
support our earlier discussion on three fusion feature sets achieving
between the attack and benign feature data helps to achieve better
better performance than the two fusion feature sets.
performance, although the model is not achieved 100% accuracy.
Fig. 11 shows the ‘‘Mergeall’’ case feature fusion set training ac-
curacy/loss characteristics when processed through the CNN model.
5.3. Large-scale learning with single and meta-classifier
The training accuracy was achieved to 100% within the five epoch
iterations and finally settled down later around epoch 25 to provide
100% accuracy. Overall, based on the all Figs. 9–11, the fusion feature Computation time is one of the constraints of processing large-scale
sets can achieve better training accuracy and even the best accuracy datasets. So, large-scale learning optimization algorithms are needed
performance when more number of the feature sets combined, which to obtain good results for large-scale datasets. We have leveraged the
are representing a particular view of the malware or benign sample. supervised learning algorithms Random Forest (RF) and SVM as the
Overall, we can conclude that the ‘‘Mergeall’’ case is the best performed downward stream layer with the proposed CNN model. Large-scale
feature fusion set in our performance evaluation. CNN with RF achieved slightly better and comparable results with CNN
The complexity of the DL models can be compared using model with the SVM algorithm. So, the CNN with RF algorithm is presented
parameters generated during the model training. Fig. 12 presents the here to compare other cases. Table 14a illustrates confusion matrix
comparison of the complexity and the performance of the different plot for the input feature fusion set ‘‘Mergeall’’ applied to the large-
DL models used for the evaluation of the four individual feature sets scale CNN with the RF algorithm. We can observe that the large-scale
and five feature fusion sets in our approach. The DL models DNN, CNN with RF can classify 222 being files correctly out of 242 and
LSTM, CNN, and CNN-LSTM were considered for our extensive perfor- 384 malicious files out of 400. Even though there is no significant
mance evaluation of the feature sets. The DNN4 and LSTM4 cases are performance improvement compared to the standalone proposed CNN
only considered from the DNN and LSTM models to clarify the figure model, the CNN with RF achieved comparable results and did not show
representation and easy comparison of different models. noticeable performance degradation.
We plotted the ROC characteristics for the selected CNN and clas- Table 14b represents the confusion matrix results for the large-
sical SVM models by inputting the ‘‘Mergeall’’ fusion feature set. As scale stacked classifier. We have considered the RF classifier and RBF
11
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Fig. 12. Comparison of number of model parameters and accuracy of the DL models.
kernel SVM as the initial estimators and the linear regression as the
final estimator in our stacking classifier. From Table 14b, the stacked
classifier was able to correctly classify 385 files as malware, which is
slightly more than large-scale CNN with RF, CNN, and SVM results. On
the other side, the benign file classification is slightly lower than the
other cases used for comparison.
Tables 14c and 14d shows the confusion matrix results for the
feature fusion sets ‘‘Mergeall’’ to the SVM and proposed CNN model.
The CNN model correctly classified 384 malware files as malware,
whereas SVM correctly classify 378 files as malware. On the other
side, CNN correctly classified 225 benign files, while SVM correctly
classified 222 benign files. It is clearly evident that the CNN model did
perform well compared to the SVM classifier, in particular, considering
the size of the malware dataset used. Overall, based on the results in
Table 14, we can infer that both CNN and the classifiers in large-scale
learning and stacked classifiers are disconnected. These two models can
be connected together by proposing a new loss function, and this type
of learning during training a model can enhance the performance of
the malware detection model [44]. This is one of the future directions
Fig. 13. ROC Characteristics for best performed model and classical SVM. of the proposed work.
5.4. Generalization
With the aim to show that the proposed method is more gen-
eralizable, the performance of the proposed method is evaluated on
other two unseen malware data samples such as sample1 and sample2.
Each dataset contains 3,000 randomly selected unseen malware data
samples. The Figs. 15 and 16 show the performance results of the
proposed model CNN and classic machine learning model SVM malware
classification on the two datasets samples. In Fig. 15, we can see that
the Proposed model CNN was able to correctly classify 2,996 malware
samples out of the 3,000 samples in dataset sample 1. On the other side,
the SVM model can correctly classify only 2,214 malware samples as
malware. The misclassification of the 786 malware samples as benign
by the SVM is an indication that it is unacceptable performance for
malware classification. Overall, we can construe that the proposed
model outperformed the SVM model using our feature fusion approach.
The second dataset sample2 is applied to the proposed model and
SVM under similar experimental settings. As shown in Fig. 16, the
proposed model CNN performed better than the SVM results and was
Fig. 14. t-SNE visualization of the best performed model for the best feature set.
able to accurately classify 2,958 malware samples in the malware cat-
egory out of the 3,000 malware samples. We can observe that the SVM
12
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
Table 14
The confusion matrix for the large-scale, meta-classifier and feature fusion best case (0:Benign, 1:Malware).
Fig. 15. SVM and CNN malware class classification on sample1 dataset. Fig. 16. SVM and CNN malware class classification sample2 dataset.
has generated 153 false negatives by misclassifying the malware as a can unpack the PE file using upx_unpack; move ‘‘.text’’ to ‘‘.xxx’’ with
benign fie, whereas CNN generated 42 false negatives on a set of 3,000 valid entry point for code instructions; create ‘‘.text’’ file and replace
samples. Our performance results on these two malware data sample ‘‘.text’’ with ‘‘.text’’ from legitimate executable ‘‘calc.exe’’. These simple
sets show that the proposed model CNN significantly performed well structural changes in the PE file can evade malware detection and
to accurately classify the malware, and SVM performed slightly better also do not change the file’s malicious behavior. In [45], the authors
when using the input sample 2 dataset rather than the sample1 dataset. mention that these small static changes can fool the ML models to
Furthermore, these malware classification results on the two malware change the prediction from malware class to benign class. We believe
data sample sets show that our proposed model is generalizable, robust, that our feature fusion approach can easily defend evasion attempts by
and able to classify unseen malware samples. leveraging the features chosen from a single view of PE files. Hence, We
have considered a combination of static, dynamic, and image features
5.5. Advantages and limitations of the proposed approach from PE files to tackle such attacks.
However, Our feature fusion approach requires a cautious selection
Our proposed DL CNN-based multi-view feature fusion approach for of features from static, dynamic, or image analysis-based on a number
malware classification achieved better performance results in compar- of factors such as sample size, operating in a production or test environ-
ison with the results obtained from best performed individual feature ment, available hardware resources, and workforce domain knowledge.
set applied on the best performed CNN model for our evaluation. Thus, Additionally, feature extraction can be a tedious and time-consuming
the feature fusion sets approach achieved an accuracy of 97% in the process and in particular, for dynamic analysis, a virtual environment
test dataset samples, whereas the best performed individual feature isolated from the internet can be required. So, it is important to choose
set obtained 94% accuracy. Although the percentage of the accuracy the minimal number of features that could cover the different aspects of
difference for the two feature sets is smaller, it is a considerable malware characteristics and behavior to combat the adversarial effects.
improvement in malware classification because an unidentified and The proportion of the malware and benignware file samples consid-
stealthy malware successful compromise can have a catastrophic effect ered for our study does not represent the real-time settings. The number
on the impacted organization’s businesses. of available windows benignware samples is much higher than the mal-
Any particular view of the executable binary files to extract features ware samples in the real world. We used the benignware samples taken
can be bypassed and leveraged to evade malware detection using DL or from the site portableapps.com and the 32-bit Windows 7 ultimate
ML models. An adversary could use simple tricks to modify the binary directory. The total number of available benignware from these two re-
files for achieving detection mode evasion. For instance, An adversary sources is minimal. So, the proportion of the malware samples is higher
13
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
than the benignware in our dataset. We plan to collect more windows [4] Kolosnjaji B, Eraisha G, Webster G, Zarras A, Eckert C. Empowering convolutional
PE file benignware samples and validate our multi view feature fusion networks for malware classification and analysis. In: 2017 International joint
conference on neural networks. IEEE; 2017, p. 3838–45.
models and perform large-scale dataset validation, which resembles
[5] Amer E, El-Sappagh S, Hu JW. Contextual identification of windows malware
the real-world production environment malware binary classification through semantic interpretation of API call sequence. Appl Sci 2020;10(21):7673.
setup. One of our future works is to perform the experiments using [6] Huang X, Ma L, Yang W, Zhong Y. A method for windows malware detection
the imbalanced large-scale datasets with more benignware sample files, based on deep learning. J Signal Process Syst 2021;93(2):265–73. http://dx.doi.
and fewer number of malware sample files. org/10.1007/s11265-020-01588-1.
[7] Nisa M, Shah JH, Kanwal S, Raza M, Khan MA, Damaševičius R, Blažauskas T.
Our study is focused on the malware binaries observed in Windows Hybrid malware classification method using segmentation-based fractal tex-
environments, precisely, limited to PE format file malware classifica- ture analysis and deep convolution neural network features. Appl Sci
tion and analysis. So, the outcomes of our work may or may not be 2020;10(14):4966.
applicable to Unix-based ELF executable or IoT malware samples. Thus, [8] Choi S, Bae J, Lee C, Kim Y, Kim J. Attention-based automated feature extraction
for malware analysis. Sensors 2020;20(10):2893.
we leave the validation of the feature fusion approach to Unix-based
[9] Vinayakumar R, Alazab M, Soman KP, Poornachandran P, Venkatraman S. Robust
malware as one of our future works. intelligent malware detection using deep learning. IEEE Access 2019;7:46717–38.
[10] Jain M, Andreopoulos W, Stamp M. Convolutional neural networks and extreme
6. Conclusion and future works learning machines for malware classification. J Comput Virol Hacking Tech
2020;16(3):229–44.
[11] Sun G, Qian Q. Deep learning and visualization for identifying malware families.
This paper proposed a deep learning CNN model for effective mal- IEEE Trans Dependable Secure Comput 2018.
ware classification using our feature fusion set approach. The proposed [12] Raff E, Barker J, Sylvester J, Brandon R, Catanzaro B, Nicholas CK. Malware
CNN model was selected by performing a comprehensive performance detection by eating a whole exe. In: Workshops at the thirty-second AAAI
conference on artificial intelligence. 2018.
evaluation of the classic ML classifier SVM and DNN, CNN, and LSTM
[13] Vinayakumar R, Soman KP. DeepMalNet: evaluating shallow and deep networks
model architectures for our feature fusion approach. Our comparative for static PE malware detection. ICT Express 2018;4(4):255–8.
performance analysis showed that the proposed CNN model outper- [14] Venkatraman S, Alazab M, Vinayakumar R. A hybrid deep learning image-based
formed other models in the majority of the individual feature datasets. analysis for effective malware detection. J Inf Secur Appl 2019;47:377–89.
[15] Vasan D, Alazab M, Wassan S, Safaei B, Zheng Q. Image-based malware
The multi-view perspective of the executable binary file is considered
classification using ensemble of CNN architectures (IMCEC). Comput Secur
to select the different f fusion feature sets for further investigation. Our 2020;92:101748.
experimental evaluation shows that the fusion feature sets performed [16] Cui Z, Xue F, Cai X, Cao Y, Wang GG, Chen J. Detection of malicious code
better than the individual feature sets. Based on the majority of our variants based on deep learning. IEEE Trans Ind Inf 2018;14(7):3187–96.
evaluation result cases, if the number of features incorporated in the [17] Ahmadi M, Ulyanov D, Semenov S, Trofimov M, Giacinto G. Novel feature
extraction, selection and fusion for effective malware family classification. In:
fusion feature set increases, then the performance of the fusion feature Proceedings of the 6th ACM conference on data and application security and
set improves the performance for the same proposed deep learning privacy, 2017. 2016, p. 183–94.
model. [18] Ni S, Qian Q, Zhang R. Malware identification using visualization images and
We are seeing that adversarial attacks on ML or DL models can have deep learning. Comput Secur 2018;77:871–85. http://dx.doi.org/10.1016/j.cose.
2018.04.005.
a significant impact on malware classification performance [46]. One of
[19] Kolosnjaji B, Zarras A, Webster G, Eckert C. Deep learning for classification of
our future work is to validate the effectiveness of combined feature sets malware system call sequences. In: Australasian joint conference on artificial
to defend the adversarial attacks. We also plan to investigate the best intelligence. Cham: Springer; 2016, p. 137–49.
suited multi-view feature sets, which require less effort to extract those [20] Catak FO, Yazı AF, Elezaj O, Ahmed J. Deep learning based sequential model for
malware analysis using windows exe API calls. PeerJ Comput Sci 2020;6:e285.
features and supports effective classification of sophisticated malware.
[21] Abusitta A, Li MQ, Fung BC. Malware classification and composition analysis: A
survey of recent developments. J Inf Secur Appl 2021;59:102828.
CRediT authorship contribution statement [22] Aslan ÖA, Samet R. A comprehensive review on malware detection approaches.
IEEE Access 2020;8:6249–71.
[23] Schultz Matthew G. Feature extraction. 2001, https://www.fsl.cs.stonybrook.edu/
Rajasekhar Chaganti: Conceptualization, Methodology, Software,
docs/binaryeval/node4.html. [Accessed 20 June 2021].
Writing – original draft, Writing – review & editing, Validation. [24] Saxe J, Berlin K. Deep neural network based malware detection using two
Vinayakumar Ravi: Conceptualization, Methodology, Software, dimensional binary program features. In: 2015 10th International conference on
Writing – original draft, Writing – review & editing, Validation. malicious and unwanted software. IEEE; 2015, p. 11–20.
[25] Azeez NA, Odufuwa OE, Misra S, Oluranti J, Damaševičius R. Windows PE
Tuan D. Pham: Writing – review & editing, Supervision.
malware detection using ensemble learning. In: Informatics, vol. 8, no. 1.
Multidisciplinary Digital Publishing Institute; 2021, p. 10.
Declaration of competing interest [26] Li C, Zheng J. API call-based malware classification using recurrent neural
networks. J Cyber Secur Mobil 2021;617–40.
[27] Zhang Z, Qi P, Wang W. Dynamic malware analysis with feature engineering and
The authors declare that they have no known competing financial
feature learning. In: Proceedings of the AAAI conference on artificial intelligence,
interests or personal relationships that could have appeared to vol. 34, (01). 2020, p. 1210–7.
influence the work reported in this paper. [28] Burnap P, French R, Turner F, Jones K. Malware classification using
self organising feature maps and machine activity data. Comput Secur
2018;73:399–410.
Data availability [29] Huang W, Stokes JW. Mtnet: a multi-task neural network for dynamic mal-
ware classification. In: International conference on detection of intrusions and
The authors do not have permission to share data. malware, and vulnerability assessment. Cham: Springer; 2016, p. 399–418.
[30] Rhode M, Burnap P, Jones K. Early-stage malware prediction using recurrent
neural networks. Comput Secur 2018;77:578–94.
References [31] Appice Annalisa, Andresini Giuseppina, Malerba Donato. Clustering-aided multi-
view classification: a case study on android malware detection. J Intell Inf Syst
[1] Johnson J. Annual number of malware attacks worldwide from 2015 to 2020. 2020;55(1):1–26.
Statista; 2021, https://www.statista.com/statistics/873097/malware-attacks-per- [32] Millar Stuart, et al. Multi-view deep learning for zero-day android malware
year-worldwide/. detection. J Inf Secur Appl 2021;58:102718.
[2] Jovanović Bojan. A not-so-common cold: Malware statistics in 2021. Dataprot; [33] Darabian Hamid, et al. A multiview learning method for malware threat hunting:
2021, https://dataprot.net/statistics/malware-statistics/. windows, IoT and android as case studies. World Wide Web 2020;23(2):1241–60.
[3] Gibert D, Mateu C, Planes J. The rise of machine learning for detection and [34] Haddadpajouh H, Azmoodeh A, Dehghantanha A, Parizi RM. MVFCC: A multi-
classification of malware: Research developments, trends and challenges. J Netw view fuzzy consensus clustering model for malware threat attribution. IEEE
Comput Appl 2020;153:102526. Access 2020;8:139188-139198.
14
R. Chaganti et al. Journal of Information Security and Applications 72 (2023) 103402
[35] Sahoo D. Cyber threat attribution with multi-view heuristic analysis. In: [41] Oliveira Angelo. Malware analysis datasets: Top-1000 PE imports. IEEE Dataport;
Handbook of big data analytics and forensics. Cham: Springer; 2022, p. 53–73. 2019, http://dx.doi.org/10.21227/004e-v304.
[36] Chaganti R, Ravi V, Pham TD. Deep learning based cross architecture internet [42] Oliveira Angelo. Malware analysis datasets: API call sequences. IEEE Dataport;
of things malware detection and classification. Comput Secur 2022;102779. 2019, http://dx.doi.org/10.21227/tqqm-aq14.
[37] Kyadige A, Rudd EM, Berlin K. Learning from context: A multi-view deep [43] Oliveira Angelo. Malware analysis datasets: Raw PE as image. IEEE Dataport;
learning architecture for malware detection. In: 2020 IEEE security and privacy 2019, http://dx.doi.org/10.21227/8brp-j220.
workshops. IEEE; 2020, p. 1–7. [44] Huang FJ, LeCun Y. Large-scale learning with SVM and convolutional nets
[38] Shi W, Zhou X, Pang J, Liang G, Gu H. A new multitasking malware classification for generic object categorization. In: Proceedings of the IEEE computer society
model based on feature fusion. In: 2018 2nd IEEE advanced information conference on computer vision and pattern recognition, vol. 1. 2006, p. 284–91.
management, communicates, electronic and automation control conference. IEEE; http://dx.doi.org/10.1109/CVPR.2006.164.
2018, p. 2376–81. [45] Anderson H, Kharkar A, Filar B, Roth P. Bot vs. bot : Evading machine learning
[39] Bai J, Wang J. Improving malware detection using multi-view ensemble learning. malware detection why machine learning? BlackHat; 2017, https://github.com/
Secur Commun Netw 2016;9(17):4227–41. EndgameInc/gym-malware.
[40] Oliveira Angelo. Malware analysis datasets: PE section headers. IEEE Dataport; [46] Kolosnjaji B, Demontis A, Biggio B, Maiorca D, Giacinto G, Eckert C, Roli F.
2019, http://dx.doi.org/10.21227/2czh-es14. Adversarial malware binaries: Evading deep learning for malware detection in
executables. In: 2018 26th European signal processing conference. IEEE; 2018,
p. 533–7.
15