Malware Detection Using Machine Learning and Deep Learning
Malware Detection Using Machine Learning and Deep Learning
Abstract. Research shows that over the last decade, malware have
been growing exponentially, causing substantial financial losses to vari-
ous organizations. Different anti-malware companies have been propos-
ing solutions to defend attacks from these malware. The velocity, vol-
ume, and the complexity of malware are posing new challenges to
the anti-malware community. Current state-of-the-art research shows
that recently, researchers and anti-virus organizations started apply-
ing machine learning and deep learning methods for malware analysis
and detection. We have used opcode frequency as a feature vector and
applied unsupervised learning in addition to supervised learning for mal-
ware classification. The focus of this tutorial is to present our work on
detecting malware with (1) various machine learning algorithms and (2)
deep learning models. Our results show that the Random Forest outper-
forms Deep Neural Network with opcode frequency as a feature. Also in
feature reduction, Deep Auto-Encoders are overkill for the dataset, and
elementary function like Variance Threshold perform better than oth-
ers. In addition to the proposed methodologies, we will also discuss the
additional issues and the unique challenges in the domain, open research
problems, limitations, and future directions.
1 Introduction
In the digital age, malware have impacted a large number of computing devices.
The term malware come from malicious software which are designed to meet
the harmful intent of a malicious attacker. Malware can compromise comput-
ers/smart devices, steal confidential information, penetrate networks, and crip-
ple critical infrastructures, etc. These programs include viruses, worms, trojans,
spyware, bots, rootkits, ransomware, etc. According to Computer Economics1 ,
financial loss due to malware attack has grown quadruple from $3.3 billion in
1997 to $13.3 billion in 2006. Every few years the definition of Year of Mega
Breach has to be recalibrated based on attacks performed in that particular
1
https://www.computereconomics.com/article.cfm?id=1225.
c Springer Nature Switzerland AG 2018
A. Mondal et al. (Eds.): BDA 2018, LNCS 11297, pp. 402–411, 2018.
https://doi.org/10.1007/978-3-030-04780-1_28
Malware Detection Using Machine Learning and Deep Learning 403
2
https://www.cbsnews.com/news/wannacry-ransomware-attacks-wannacry-virus-
losses.
3
https://cybersecurityventures.com/hackerpocalypse-cybercrime-report-2016.
404 H. Rathore et al.
2 Experimental Setup
We formulate the problem of malware analysis and detection as a binary classifi-
cation problem where malware and benign are the two classes. Figure 1 shows the
proposed approach is a multi-step process consisting of various phases performing
several tasks: collection of the dataset, disassembling of executable files, feature
extraction, dimension reduction, building classification models, and empirical
analysis of the results based on different metrics. We discuss each of these phases
in the following subsections.
2.1 Dataset
To conduct our experiments, we gathered malware and benign executables from
different sources. We downloaded malware samples from an open source repos-
itory known as Malicia Project4 . In Malicia Project, Nappa et al. [9] have col-
lected 11, 688 malware samples on Windows platform belonging to a total of
55 different malware families The data collection is performed over a span of
11 months (07/03/2012 to 25/03/2013) from more than 500 drive-by download
servers also known as exploit servers. Typically these servers are deployed for a
lifetime of 16 h while some servers even operated for months to spread the mal-
ware files. Many malware executables in the dataset will connect to the internet
without user consent to perform some cybercrime operation. Most of the mali-
cious executable will also repack themselves on an average of 5.4 times in a day
to evade the antivirus signature-based detection system. Thus opcode frequency
as a feature can be an excellent measure to detect these malware.
To collect benign executable samples for our dataset, we gathered default
files installed in different Windows operating system. VirusTotal5 is an anti-
virus aggregator that can be used to check whether an executable is malicious
or benign. We declare a sample as non-malicious/benign when all the anti-virus
from virustotal.com declares it as harmless. We combine the malware and benign
executable files downloaded from different sources (Malicia and Windows) and
use it as our experimental dataset. Thus the dataset contains 11, 688 malware
and 2, 819 benign executable files.
4
http://malicia-project.com/.
5
https://www.virustotal.com/.
Malware Detection Using Machine Learning and Deep Learning 405
Fig. 1. Flowchart for the classification of malware with different sets of features.
(Source: Sewak et al. [12])
Since there is a significant difference between the number of malware (11, 308)
and benign executables (2, 819) in our dataset, thus it will lead to class imbal-
ance problem. Various methods are available to solve class imbalance problem
like random sampling (oversampling/undersampling), cluster-based sampling,
ADASYN [5], etc. We used Adaptive Synthetic sampling approach for imbal-
anced learning (ADASYN) which is an oversampling method for minority class
tuples. It synthetically generates data points of minority class based on the k-
nearest neighbor algorithm.
As discussed in Sect. 2.3, our dataset contains a large number of features
and executable files thus we used cross-validation to generalize our model to an
independent dataset. We used 3-fold cross validation in all our experiments. In
rotation estimation (a.k.a. cross-validation) data is split into three equal parts
where two blocks are used to training the model, and remaining one block is used
for testing. The above exercise is done three times to accommodate all possible
combinations.
When we further analyzed our feature set, we found that for few opcodes the
corresponding frequency is zero since the particular opcodes are deprecated. Also
for few opcodes, the count was relatively less because they are platform specific
and the platform is deprecated. A model created on a dataset suffering from
the curse of dimensionality will take a longer time to train and is inefficient in
space complexity as well. To choose an optimal number of features we are using
different variants of dimensionality reduction methods.
1. None: In this method all the opcodes are taken into account for building a
classification model without using any feature reduction. We use this as a
baseline for different feature reduction methods.
2. Variance Threshold: It is a method used to remove the features with low
variations. We have removed the attributes with a variance of less than 0.1
assuming they have less prediction power.
3. Auto-Encoders: In deep learning auto-encoders are unsupervised learning
methods which require only feature vector (opcode frequency), and not class
labels for dimensionality reduction.
(a) A single layer auto-encoder (Non Deep Auto-Encoder), also referred to
as AE-1L which contain one encoder layer and a decoder layer.
(b) A 3-layer stacked auto-encoder(Deep Auto-Encoder), also referred to as
AE-3L which contain three encoders followed by three decoders.
For our experiments, all the auto-encoders use Exponential Linear Unit (ELU)
function at all the layers except in the last layer which uses linear activation
function. In AE-1L, the input directly connects to bottleneck layer which in
turn link to the output layer. In both the auto-encoder (AE-1L and AE-3L)
models, the bottleneck layer consists of 32 ELU nodes. Thus the architecture of
AE-1L is (Input-32-Output) where bottleneck layer will behave as both encoder
and decoder. In case of AE-3L where encoder consists of two additional hidden
layers connected in sequential order containing 128 and 64 nodes respectively.
Similarly, AE-3L decoder comprised of two hidden layers of similar width but
connected in reverse order. Thus architecture of AE-3L will be (Input-128-64-
32-64-128-Output). For training of both the auto-encoders (AE-1L and AE-3L),
the mean square error is used as a loss function over a batch size of 64 samples.
Instead of using standard stochastic gradient we have used Adam optimizer [6]
to train a batch over 120 epochs. The Fig. 2 shows the training and validation
loss for AE-1L during a complete cycle. The plot shows mean squared error
loss (y-axis) for training and validation which are converging around 120 epoch
(x-axis).
In this paper, we used both machine learning and deep learning based approaches
to build the classification models. Based on learning methods we divided our
work into two case studies: (1) model based on the Random Forest (RF). In
the previous studies [10,14] conducted on the Malicia dataset [9], we found that
408 H. Rathore et al.
Fig. 2. Plot for AE-1L shows mean Fig. 3. Plot for DNN-2L shows cross
squared error loss (y-axis) for training entropy loss (y-axis) for training and
and validation across 120 epochs (x- validation across 120 epochs (x-axis)
axis) (Source: Sewak et al. [13]) (Source: Sewak et al. [13])
4 Results
In this section, we will discuss the experimental results obtained after feature
reduction (refer Sect. 2.3) with classification models (refer Sect. 3) using various
evaluation metrics (accuracy, recall, selectivity, and precision).
Malware Detection Using Machine Learning and Deep Learning 409
Table 1 reveals that for different feature reduction methods we found that VT
(combined with RF) based attribute reduction achieved the highest accuracy of
99.78% which is marginally higher than no reduction (None and RF) 99.74% in
the feature set. AE-1L performed better than deeper Auto-Encoder (AE-3L) and
obtained the highest accuracy (99.41%) with RF. AE-3L based reduction per-
formed lowest in all the methods. Highest True Positive Rate (TPR) of 99.59%
was archived by VT (and RF) followed by None, and highest True Negative Rate
(TNR) of 100% was achieved by no feature reduction (None and RF).
Table 1 shows that among different classification models, RF outperformed
the deep learning models and achieved the highest accuracy of 99.7% (RF and
VT). RF again produced the second highest accuracy with no feature reduction.
Between different deep learning models, DNN-3L and DNN-7L both combined
with AE-1L attained an accuracy of 98.99%. Highest TPR and TNR were pro-
duced by RF with VT and None as feature reduction respectively.
5 Conclusion
In the last few years malware have become a significant threat. Classical defense
mechanism (like signature-based malware detection) used by anti-virus will fail
to cope up new age malware challenges. In this paper, we have modeled malware
410 H. Rathore et al.
analysis and detection as machine learning and deep learning problem. We have
used best practices in building these models (like cross-validation, fixing class
imbalance problem, etc.). We expertly handled the curse of dimensionality by
using various feature reduction methods (None, AE-1L and AE-3L). Finally, we
compared the models build using RF and DNN (DNN-2L, DNN-4L, and DNN-
7L).
Based on our results random forest outperforms all the three deep neural net-
work models in malware detection. We achieved the highest accuracy of 99.78%
with random forest and variance threshold which is an improvement of 1.26%
on previously reported the best accuracy. Also in feature reduction, variance
threshold outplayed auto-encoders in improving the model performance. Another
significant contribution of our investigation is a comparison of different combina-
tions of auto-encoder (of depth 1 and 3) and deep neural network (of depth 2, 4
and 7) for malware detection. To our surprise, the best result did not come from
any of the deep learning models which indicates that deep leaning may be overkill
for Malicia dataset and the trained models are moving towards overfitting.
The same models can be used to detect more complex malware (polymorphic
and metamorphic) in the future. Further, it will be interesting to see the effec-
tiveness of other deep learning techniques like recurrent neural network, long
short-term memory, etc. for malware detection.
References
1. What is zeus? (2011). https://www.sophos.com/en-us/medialibrary/pdfs/
technical%20papers/sophos%20what%20is%20zeus%20tp.pdf
2. David, O.E., Netanyahu, N.S.: Deepsign: deep learning for automatic malware
signature generation and classification. In: 2015 International Joint Conference on
Neural Networks (IJCNN), pp. 1–8. IEEE (2015)
3. Firdausi, I., Erwin, A., Nugroho, A.S., et al.: Analysis of machine learning tech-
niques used in behavior-based malware detection. In: 2010 Second International
Conference on Advances in Computing, Control and Telecommunication Tech-
nologies (ACT), pp. 201–203. IEEE (2010)
4. Hardy, W., Chen, L., Hou, S., Ye, Y., Li, X.: Dl4md: a deep learning framework
for intelligent malware detection. In: Proceedings of the International Conference
on Data Mining (DMIN), p. 61. The Steering Committee of The World Congress
in Computer Science, Computer Engineering and Applied Computing (2016)
5. He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: adaptive synthetic sampling app-
roach for imbalanced learning. In: IEEE International Joint Conference on Neural
Networks IJCNN 2008. (IEEE World Congress on Computational Intelligence), pp.
1322–1328. IEEE (2008)
6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
7. Masud, M.M., et al.: Cloud-based malware detection for evolving data streams.
ACM Trans. Manage. Inf. Syst. (TMIS) 2(3), 16 (2011)
8. Moskovitch, R., et al.: Unknown malcode detection using OPCODE representation.
In: Ortiz-Arroyo, D., Larsen, H.L., Zeng, D.D., Hicks, D., Wagner, G. (eds.) EuroIsI
2008. LNCS, vol. 5376, pp. 204–215. Springer, Heidelberg (2008). https://doi.org/
10.1007/978-3-540-89900-6 21
Malware Detection Using Machine Learning and Deep Learning 411
9. Nappa, A., Rafique, M.Z., Caballero, J.: Driving in the cloud: an analysis of drive-
by download operations and abuse reporting. In: Rieck, K., Stewin, P., Seifert,
J.-P. (eds.) DIMVA 2013. LNCS, vol. 7967, pp. 1–20. Springer, Heidelberg (2013).
https://doi.org/10.1007/978-3-642-39235-1 1
10. Sahay, S.K., Sharma, A.: Grouping the executables to detect malwares with high
accuracy. Procedia Comput. Sci. 78, 667–674 (2016)
11. Santos, I., Brezo, F., Ugarte-Pedrero, X., Bringas, P.G.: Opcode sequences as rep-
resentation of executables for data-mining-based unknown malware detection. IET
Inf. Sci. 231, 64–82 (2013)
12. Sewak, M., Sahay, S.K., Rathore, H.: Comparison of deep learning and the classical
machine learning algorithm for the malware detection. In: 2018 19th IEEE/ACIS
International Conference on Software Engineering, Artificial Intelligence, Network-
ing and Parallel/Distributed Computing (SNPD), pp. 293–296. IEEE (2018)
13. Sewak, M., Sahay, S.K., Rathore, H.: An investigation of a deep learning based
malware detection system. In: Proceedings of the 13th International Conference
on Availability, Reliability and Security, p. 26. ACM (2018)
14. Sharma, A., Sahay, S.K.: An effective approach for classification of advanced mal-
ware with high accuracy. arXiv preprint arXiv:1606.06897 (2016)
15. Ye, Y., Li, T., Adjeroh, D., Iyengar, S.S.: A survey on malware detection using
data mining techniques. ACM Comput. Surv. (CSUR) 50(3), 41 (2017)
16. Ye, Y., Li, T., Chen, Y., Jiang, Q.: Automatic malware categorization using cluster
ensemble. In: Proceedings of the 16th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 95–104. ACM (2010)
17. Ye, Y., Wang, D., Li, T., Ye, D.: IMDS: intelligent malware detection system. In:
Proceedings of the 13th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 1043–1047. ACM (2007)
18. Ye, Y., Wang, D., Li, T., Ye, D., Jiang, Q.: An intelligent pe-malware detection
system based on association mining. J. Comput. Virol. 4(4), 323–334 (2008)
19. Yousefi-Azar, M., Varadharajan, V., Hamey, L., Tupakula, U.: Autoencoder-based
feature learning for cyber security applications. In: 2017 International Joint Con-
ference on Neural Networks (IJCNN), pp. 3854–3861. IEEE (2017)
20. Zak, R., Raff, E., Nicholas, C.: What can n-grams learn for malware detection? In:
2017 12th International Conference on Malicious and Unwanted Software (MAL-
WARE), pp. 109–118. IEEE (2017)