Classification of Malware Detection Using Machine Learning Algorithms A Survey
Classification of Malware Detection Using Machine Learning Algorithms A Survey
Abstract: Malware is the one which frequently growing day by day and becomes major threats to the Internet Security. The are several methods for
classifying of new malware from the existing signatures or code. The traditional approaches are not much effective to compete the new arriving malware
samples. More antivirus softwares provides defense mechanism against malwares but still zero-day attack is not achieved. To enhance in mechanisms
machine learning algorithms are used and provide good experimental results accordingly. While the traditional signature approaches are also failed to
compete the new malwares. In this paper, we define malware and types of malware as an overview, as well we define the new mechanism of using
machine learning algorithms how effective and efficient in classification of malware detection and we presented the existing works related to malware
detection classification using machine learning algorithms and it is discussed about main important challenges that are facing in malware detection
classification.
Index Terms: Malware, Malware Analysis, Static Analysis, Dynamic Analysis, Classification, Machine learning, Data mining Techniques, Malicious Code.
—————————— ——————————
1796
IJSTR©2020
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 02, FEBRUARY 2020 ISSN 2277-8616
(unwanted signatures). Both remote analysis and local network activity or behavior of malware. From the network
analysis of malware detection techniques are used in their traces (pcap file) behavior tree is generated and then features
framework. A file is checked whether it is malicious or benign are extracted from it and perform the classification using
with help of signatures. In the remote analysis, various different machine learning algorithms. Finally, it is found that
antivirus software’s is used to analyzing malicious executables the J48 classifier gives better accuracy results among different
and API. In the local analysis, Anti-virtual machine, anti- anti-virus program comparisons. Kosmidis et al. 2017 [34]
debugger, URL analysis, string analysis, and packing analysis provide an automated framework to classify the unknown
are used. Ronen et al. 2018 [26] provide a standard malware samples using neural networks. Feature engineering
benchmark dataset that was announced by Microsoft Malware used to extract features from the malimg dataset. Perceptron,
Classification Challenge which is cited by several malware decision tree, nearest centroid, stochastic gradient, multilayer
researchers, and it is serving in the Kaggle competition. The perceptron, random forest algorithms are used to classify the
dataset consists of 9 different malware families with 0.5 unknown malware. Random Forest gives better average
terabytes of huge data having more than 20,000 malware accuracy results and testing time also considered as a
samples in byte code cited by nearly 50 research papers. Ye et parameter. Gandotra et al. 2014 [35] provide the integrated
al. 2017 [27] presented a survey on malware detection using framework to classify the unknown malware using the
data mining techniques which is focused on intelligent integrated feature set from both static and dynamic features to
malware detection approaches. They illustrate two stages get a better classification of malware samples. The evaluation
which are feature extraction and classification/clustering as of feature extracted dataset and classification is done by using
important stages in malware analysis and detection. They the four classifiers which are Multilayer perceptron, IB1,
presented the research works from 2011 to 2016 and issues Decision Tree and random forest. In the experiment results,
and challenges in the malware detection using data mining Random Forest gives high accuracy results with 99.58% of
approaches. Wang et al. 2017 [28] introduce the detecting unknown malware and classification. Islam et al.
implementation and design of sandbox, feature extractor, and 2013 [36] introduced the framework earlier than Gandotra et
the classifier. There are mainly three stages in their work al. [35], it is similar to an integrated framework of static and
which are collector, extractor, and classifier. Collector contains dynamic features and performs classification using classifiers
static analysis program and dynamic execution with the to classify the unknown malware. Gandotra et al. 2014 [37]
module PinFWSandbox which records the dynamic surveyed the malware analysis basically on classification and
information and log file information and it passed to the usage of machine learning in malware analysis. Makandar et
extractor stage. Extractor performs both static feature al. 2015 [38] propose a classification of malware families using
extraction, dynamic instruction feature extraction and system artificial neural networks. The malware binaries are converted
called feature extraction. Finally, the classifier performs the to grayscale images and the images are resized. From the
action of combining all the classifier models such as single resized image sub-band filtering is applied to extract features
model classifier result, system call classifier result, dynamic and then feature vector is formed. To extract texture features
immediate classifier result and dynamic opcode classifier Gabor wavelet and GIST descriptor are used. To classify the
result to give the better result of the f1 score which gives extracted features feed-forward back propagation neural
nearly close to 96%. Pai et al. 2017 [29] present malware networks are applied and the experiment result gives 96.35 %
classification using clustering algorithms. Static features are accuracy of unknown malware classification. Kruczkowski et
extracted from the opcode sequences and with their scores. al. 2014 [39] used the Support Vector Machine (SVM) to
These static features are used for malware classification using classify the malware samples using three validation
the clustering algorithms K-means, Expectation-Maximization, techniques which are cross-validation, Leave – one- out and
and Hidden Markov Models. Among the clustering algorithms, Random Sampling (RS). Among the three validation
Expectation-Maximization gives better results of accuracy. This techniques, Random Sampling gives better results of accuracy
is the different approach for malware classification which 94. 98%.Nataraj et al. 2011 [40] presented a novel approach to
includes machine learning clustering algorithms. Gupta et al. classify the malware samples. Malware binaries are converted
2016 [30] present a framework using windows API call to an 8-bit vector and an 8-bit vector to a grayscale image.
sequences. Five malware classes are classified which are Image texture and feature vectors are used for classification.
Worm, Trojan-Downloader, Trojan-Spy, Trojan-Dropper and The Gabor wavelet and GIST descriptor are used in feature
Backdoor among 2000 malware samples using API call extraction. In the experimental results, 98% of accuracy is
sequence and fuzzy hashing based classification. Liu et al. obtained. Tian et al. 2009 [41] present malware classification
[31] provide an approach to evaluate and classify unknown using printable strings which are in malware executables in an
malware instances to cluster with respect to their families. To efficient way. Five classifiers are applied which are Naïve
classify the malware instances shared nearest neighbor Bayes, Support vector machine, IB1, Random forest and
clustering algorithm. The experimental result gives 98.9% Decision Tree on extracted features. The efficiency of all
accuracy for known malware instances and 86.7% of unknown classifiers is improved using the AdaBoostM1 meta-classifier.
malware is correctly classified of new malware instances. In the experimental results of WEKA, Random forest and IB1
Makandar et al. 2017 [32] give an overview of malware gives better results of the overall accuracy of 97%.Khammas
analysis and detection techniques with different types of et al. 2015 [42] give malware classification using the n-gram
malware families. Different methods or techniques are used to technique. For feature selection from dataset Principal
detect and analyze malware, one method is to visualize Component Analysis (PCA) is used for better real-time results.
malware in the form of an image, grayscale image, etc. Neural Network (NN), Decision Tree (J48), Support Vector
Different existing works related to visualization of malware Machine (SVM) and Naive Bayes (NB) classifiers are used for
families are given an overview of their work. Nari et al. 2013 classification. In the experimental results, 97% accuracy gives
[33] propose the automated malware classification based on by SVM. Devesa et al. 2010 [43] present automatic detection
1798
IJSTR©2020
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 02, FEBRUARY 2020 ISSN 2277-8616
1799
IJSTR©2020
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 02, FEBRUARY 2020 ISSN 2277-8616
1800
IJSTR©2020
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 02, FEBRUARY 2020 ISSN 2277-8616
x Vector e of classifier learning algorithms are very useful for the classification and
Machine, Inform with clustering of malware samples for small datasets and for large
PCA, MG ation feature
TF-IDF Industr selection
volumes of data.
y and
feature 6 REFERENCES
extraction
methods
[1] AV-TEST (2018, November 28). The Independent IT-
gives Security Institute, Malware Statistics [Online]. Available:
better https://www.av-test.org/en/statistics/malware/
performan [2] IDAPro. (2018, November 28). [Online]. Available:
ce and https://www.hex-
accuracy.
rays.com/products/ida/support/download_freeware.shtml
LMT
classifier [3] OllyDbg. (2018, November 28). [Online]. Available:
Expand http://www.ollydbg.de/
gives a
for large
LMT, high
datasets
[4] LordPE. (2018, November 28). [Online]. Available:
Support accuracy http://www.woodmann.com/collaborative/tools/index.php/Lor
and test
Vector of 98.28%. dPE
for a
Machine, Online clustering
[45] WEKA Ridor, source of
combinati [5] OllyDump. (2018, November 28). [Online]. Available:
on of http://www.woodmann.com/collaborative/tools/index.php/Olly
KNN, s malware
more Dump
Naive samples k-
classifiers
Bayes, K- means [6] Willems, C., Holz, T. and Freiling, F. (2007) Toward
to get
means gives Automated Dynamic Malware Analysis Using Cwsandbox.
better
better
performan
results. IEEE Security & Privacy, 5, 32-39.
ce http://dx.doi.org/10.1109/MSP.2007.45
Multi- [7] Anubis. (2018, November 28). [Online]. Available:
FTP Naive
Utilization
http://anubis.iseclab.org/
sites Bayes [8] Bayer, U., Kruegel, C. and Kirda, E. (2006) TTAnalyze: A
Naive of bye
at gives high Tool for Analyzing Malware. Proceedings of the 15th
Bayes, sequence
[46] --- Colum accuracy
RIPPER,
bia of 97.76%
to European Institute for Computer Antivirus Research Annual
MNB extending Conference.
Univer of
work.
sity classifying [9] Norman Sandbox. (2018, November 28). [Online]. Available:
malware http://sandbox.norman.no
[10] Dinaburg, A., Royal, P., Sharif, M. and Lee, W. (2008) Ether:
4 DISCUSSION Malware Analysis via Hardware Virtualization Extensions.
Previous existing literature works of malware detection prove Proceedings of the 15th ACM Conference on Computer and
that successfully classification is done with the help of Communications Security, CCS’08, Alexandria, 27-31
machine learning techniques but still there are some issues October 2008, 51-62.
have not resolved. Zero-day attacks are the one which the day [11] ThreatExpert. (2018, November 28). [Online]. Available:
having no new malware will rise in future. It is the main aim of http://www.threatexpert.com/submit.aspx
all the malware researchers. Basing on the survey [1] by AV- [12] Process Explorer. (2014). [Online]. Available:
Test still, millions of new malware are rising day-by-day. Some http://technet.microsoft.com/en-
issues and challenges of malware detection are still there and us/sysinternals/bb896653.aspx
not yet resolved. One issue is to real verification or manual [13] Process Monitor. (2014). [Online]. Available:
verification of classification results becomes harder when it http://technet.microsoft.com/en-
comes to reality. Another issue is to improve the advancement us/sysinternals/bb896645.aspx
of techniques for more active learning. The more recent [14] Capture BAT. (2018, November 28). [Online]. Available:
advancements are expecting from the fields of machine https://www.honeynet.org/node/315
learning, ensemble learning, deep learning and more. The [15] Regshot. (2018, November 28). [Online]. Available:
most advanced techniques are needed to achieve Zero-day http://sourceforge.net/projects/regshot/
Attacks. Dealing with large datasets is also one of the issues. [16] Wireshark. (2018, November 28). [Online]. Available:
These advanced techniques are needed in dimensionality http://www.wireshark.org/
reduction. [17] Process Hacker replace. (2018, November 28). [Online].
Available: http://processhacker.sourceforge.net/
[18] Gupta, D., & Rani, R. (2018). Big Data Framework for Zero-
5 CONCLUSION Day Malware Detection. Cybernetics and Systems, 49(2),
This paper presents the survey about existing literature on
103-121.
malware analysis using different machine learning algorithms.
[19] Cho, I. K., Kim, T. G., Shim, Y. J., Ryu, M., & Im, E. G. (2016).
Table 1 defines the different literature of existing works with
Malware Analysis and Classification Using Sequence
what are the tools used in their work, what are the machine
Alignments. Intelligent Automation & Soft Computing, 22(3),
learning algorithms they used in their work, from what sources
371-377.
dataset is collected, what are parameters they consider to
[20] Burnap, P., French, R., Turner, F., & Jones, K. (2018).
reach their goal and the corresponding experimental results
Malware classification using self organising feature maps
and what are the future works are proposed all are listed in the
and machine activity data. computers & security, 73, 399-
table form. In the discussion, it clearly identifies that machine
1801
IJSTR©2020
www.ijstr.org
INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 02, FEBRUARY 2020 ISSN 2277-8616