Malicious Code Invariance Based On Deep Learning
Malicious Code Invariance Based On Deep Learning
1
Sameena S et al., International Journal of Information Technology Infrastructure , 10(3), May - June 2021, 1-7
converted the malicious executable file into grayscale frequency, way of attack and other are used to compare the
images. Then they identified malware according to the training set and the input image
texture features of these input images. Compared with the
dynamic analysis method, their approach produced
equivalent results. In similar work, Han et al. transformed 3. PROPOSED SYSTEM
malware binary information into color image matrices, and
arranged malware families by using an image processing In our Proposed Model, Deep learning is used to detect
method. malware family. In our Model, we have 17 classes of which
15 are the malware families namely 1. Agent, 2. Adopshel,
Image Processing Techniques for Malware Detection[3]:
Whenever the malware has been visualized as grayscale 3.Allaple, 4.BrowseFox, 5.Dinwod, 6.Elex, 7.Expiro,
images, malware detection can be converted into an image 8.Fasong, 9.Hlux, 10.Injector, 11.Neshta, 12.Vilsel,
13.Regrun, 14.Stantiko, and 15.VBKrypt. The 16’th family
recognition problem. Nataraj et al. used a GIST algorithm for
extracting the features of malware images. However, the consist of the Non-Malicious class and 17’th is the Unknown
GIST algorithm was time-consuming. Now a days, more class.
powerful image processing techniques have been proposed. We worked a CNN network to learn the malware image
Daniel et al. developed a bio-inspired parallel features and classify them automatically. The CNN is a
implementation for identifying the representative geometrical convolutional neural network which consists of different
objects of the homology groups in a binary 2D image. For layers that perform different functions. The CNN is used to
image fusion, Miao et al. presented an image fusion train the different malware families in the dataset. In our
algorithm based on shearlet and genetic algorithm. In their proposed paper, we are using coloured images instead of
model, a genetic algorithm is employed to optimize the grayscale images. The features of the images which we are
weighted factors in the fusion rule. Experimental results inputting will be automatically extracted by the CNN and
demonstrated their method could acquire better fusion quality compare it with the trained dataset and then it will predict the
than other methods. These traditional models, however, were class of the inputted images. This is how our model works.
challenged by the high time cost required for complex image
texture feature extraction. To label this challenge, we The main advantage of our proposed system is the ability to
employed deep learning to identify and classify images generate data whenever there exist a data imbalance. The
efficiently. In the next section, we present our research on data augmentation technique is used to solve the data
malware detection based on deep learning. imbalance. Image Data Augmentation Technology. In deep
Malware Detection based on Deep Learning[4]: Deep learning, to avoid overfitting problem, we usually need to
learning is an area of machine learning research that has enter sufficient data to train the model. If the data sample is
emerged in recent years from the work on artificial neural small, we can use data augmentation to increase the sample,
networks. Neural networks can approximate complex thereby restraining the influence of imbalanced data. The
functions by learning the deep nonlinear network structure to appropriate data augmentation method can avoid overfitting
solve complex problems. Deep learning, which is more problems and improve the robustness of the model.
powerful than back propagation, uses a deep neural network
to simulate the human brain’s learning processes. Deep Generally, transformation of the original image data
learning has the ability to learn the essential characteristics of (changing the location of image pixels and ensuring that the
data sets from a sample set. As a powerful tool of artificial features are still unchanged) is used to generate new data.
intelligence, deep learning has been applied in many fields, There are many kinds of data augmentation techniques for
such as recognition of handwritten numerals, speech images, for example, rotation=reflection, flip, zoom, shift,
recognition, and image recognition. Because of its powerful scale, contrast, noise, and color transformation.
ability to learn features, many scholars have applied deep
learning to malware detection. Using deep learning A mechanism to convert .exe files to colored image is used.
techniques, Yuan et al. created and implemented an online The input will be images from the malware images. There
malware detection prototype system, named Droid-Sec. will be a graphical user interface (GUI) for inputting the
image data. The image will be uploaded through this
Their model attained high accuracy by learning the features interface. The output of the system will be the predicted
extracted from both static analysis and dynamic analysis of malicious family of given image. The system predicts the
Android apps. David et al. presented a similar but more malicious family based on the model file created during
compelling method that did not need the type of malware training.
behaviour. Their work was based on the deep belief network
(DBN) for automatic malware signature generation and The first one is the input layer, which brings the training
classification. Compared with conventional signature images into the neural network. Next are the convolution and
methods for malware detection, their model demonstrated sub-sampling layers. The former layer can enhance signal
increased accuracy for detecting new malware variants. And, characteristics and noise can be reduced. The latter can
these methods remained based on the analysis of features reduce the amount of data processing while retaining various
extracted by static analysis and dynamic analysis. Therefore, useful information. Then there are several fully connected
to a greater or lesser extent, they continuedto the limitations layers that convert a 2-dimensional feature into a 1-
of feature extraction. To address this problem, we employed dimensional feature that conforms to the classifier criteria.
a CNN network to learn the malware image features and Finally, the classifier finds and sorts the malware images into
classify them automatically. Features of the images like different families according to their characteristics.
2
Sameena S et al., International Journal of Information Technology Infrastructure , 10(3), May - June 2021, 1-7
This layer scan reduce the number of image parameters while The steps that we used in this algorithm is following. The
preserving the main features, termed invariance, including first algorithm which is the custom algorithm it is a set of
translation invariance, rotation invariance, and scale binding dynamics that are generated on an individual
invariance. This process can avoid overfitting problem campaign basis and designed to deliver outcomes that are
effectively, and improve the generalization ability of the aligned to a specific goal. The second one is the CNN which
model. The input is several maps, and the output will be the is a convolutional neural network which consists of different
maps after dimension reduction. Each map will be a layers to carry out feature extraction, classification, image
combination of convolution values of input maps that belong recognition. By using this CNN we achieved more accuracy.
to the upper layer [1], and can be given by the following It consists of Deep neural networks which is a powerful tool
equation: for classification.
xjl = f(∑i€Mj xjl-1 *Kijl + bjl) (1) 4.1The Custom Algorithm Used for Converting exe
……File……
where Mj is the collection of input maps, klij is the
convolution kernel used for the connection between the ith Select the .exe file then perform the following,
input feature map and the jth output feature map, blj is the
bias similar to the jth feature map and f is the activation In this algorithm, we first convert .exe file to binary
function. The sensitivity is: format.
The binary array will be converted to blocks.
δlj= δl+1j Wjl+1δ°f1(ul) = βjl+1up(δjl+1)°f1(ul) (2)
Each block will be converted to pixel value.
With pixel values create the image with defined
where l+1 layer is the sampling layer, the weight W
height and width.
represents
a convolution kernel, and its value βl+1° up(.) is an up
4.2 CNN Classification Method
sampling operation. The partial derivative of the error cost
function with respect to bias b and convolution kernel k can CNNs are most efficient when it comes to image
be given as: classification. In this, the mathematical function of
convolution which is a special kind of linear operation
wherein two functions are multiplied to produce a third
function which expresses how the shape of one function is
dE/dbj = ∑u,v (δjl)u,v (3)
modified by the other one.
dE/dklij= ∑u,v (δlj )u,v(p l1-i )u,v (4)
This layer is also called the pooling layer. Generally, its and
it We are using a custom Algorithm and the Convolutional
Neural Network to train our proposed model that is the
Malware Classification Model.reduce the dimension of the
feature map, improves the model’s accuracy, and avoids
overfitting. In CNN, for each output of the sampling layer,
the feature map is given as follows:
xji= f(down(xjl-1 ) + bji) (5)
Figure4.1: The stages in CNN
where down(:) represents a sub-sampling function, and b is
bias. The sensitivity is calculated as shown: First, we pass an input image to the convolutional
layer. The output is obtained as an activation map.
δlj= δjl+1 Wjl+1° f1(ul). (6) The filters applied in the convolution layer will
extract the required features from the input image to
4. ALGORITHM pass it.
Each filter shall give a different feature to grant the
There are two major algorithms. correct class prediction. In any case that we need to
retain the size of the input image, we use same
One is the custom algorithm, and padding (zero padding), otherwise valid padding is
used since it helps to reduce the number of features
Another one is the Convolutional Neural Network to it.
Pooling layers can reduce the number of parameters
3
Sameena S et al., International Journal of Information Technology Infrastructure , 10(3), May - June 2021, 1-7
The dataset refers to a file that contains one or more record. Hlux describes software with malicious behavior that aims to
Record is the basic unit of information that is used by a gather information about a person or organization and send
program running on z/OS. Any named group of record is such information to another entity in a way that harms the
called a data set. Our dataset consists of 17 classes. Each users.
class contains 350 images. Therefore, we have 5950 images 5.1.10 Injector
in total. The first 15 classes are the malicious families, the
16’th class is the non-malicious class and 17’th class contains Injector trojans insert malicious code into processes running
the unknown images. on a computer in order to perform various actions such as
downloading additional malwares, interfering with web
5.117 Classes browsing activities or monitoring the user's actions.
The following shows the 17 classes in our dataset that we 5.1.11 Neshta
used to train our model.
Neshta is an older file infector that is still general in the wild.
5.1.1 Agent It was initially percieve in 2003 and has been previously
Agent is also called as Trojan. Trojan:W32/Agent is a very associated with BlackPOS malwares. It prepends malicious
large family of programs, most of which download and code to infected files. This threat is commonly introduced
install adware or malware to victim's machine. Agent into an environment through unintented downloading or by
variants may also change the configuration settings for other malwares.
Windows Explorer and/or for the Windows interface. 5.1.12 Vilsel
5.1.2 Adopshel Vilsel is detection name for a family of Trojans that change
Adopshel is classified as a type of Riskware. Riskware is any the system’s proxy settings, bypass the Windows firewall,
potentially unwanted application that is not classified as and downloads and executes other malwares.
malware, but may utilize system resources in an frightful or 5.1.13 Regrun
annoying manner, and/or may pose a security risk.
Regrun is a devious Trojan horse that may be installed onto a
5.1.3 Allaple PC through a malicious link or even a web browser attack.
Allaple is a multi-threaded, polymorphic network worm After installed, Trojan. Regrun may open up the infected
capable of spreading to other computers connected to a local system to a remote hacker. The remote hacker could then
area network (LAN). obtain personal data and files from the infected PC. To
completely eliminate the threat that has comes with the
5.1.4 BrowseFox installation of Trojan. Regrun, a computer user should use a
reputable anti-spyware program.
BrowseFox is Malwarebytes’ detection name for a large
family of adware that uses different methods of browser 5.1.14 Stantiko
hijacking and monetizing to get their message across.
Stantiko can be used to execute certain operations such as
5.1.5 Dinwod searches, filling out forms, signing up for email lists that
you’re unaware of, and even allowing other backdoor
This Trojan arrives on a system as a file dropped by other activities. The backdoor has a loader to execute any
malwares or a file downloaded unknowingly by users when practicable, and allowing the threat operators to execute any
visiting malicious sites.
4
Sameena S et al., International Journal of Information Technology Infrastructure , 10(3), May - June 2021, 1-7
code on the thousands of machines that belong to this botnet. models: Gray scale-based model, GIST+KNN, GIST+SVM,
It contains two malicious Windows services that can reinstall GLCM + KNN, and GLCM+SVM.
The Graph and the confusion matrix for our approach is
5.1.15 VBKrypt shown below in figure 6.1 and 6.2. It shows the accuracy
This malware family is written in the Visual Basic obtained with different malware families that is the seventeen
programming language, which is its main distinguishing classes. The graph shows the different losses and accuracies
traits from other malware families. This is a different class of for training and validation sets.
malware
5.1.16 NonMalicious
This class consists of the images which does not belong to
the malicious classes. Which means the non-malicious
images will be classified in to this class.
5.1.17 Unknown
The images which are malicious but doesn’t belong to any of
the above 15 malware families will be classified in to this
class.
6. EVALUATION
5
Sameena S et al., International Journal of Information Technology Infrastructure , 10(3), May - June 2021, 1-7
REFERENCES
[1] H. Gao, Y. Du, and M. Diao. Quantum-inspired
glowworm swarm optimization and its
application. International Journal of computing Science
and Mathematics. International Journal of Computing
Science and Mathematics,8(1):91-100, 2017.
[2] D. D’ iaz-pernil, A. Berciano, F. pena-Cantillana, and M.
A.Gutierrez Naranjo. Bio-inspired parallel computing
of representative geometrical objects of holes of
Figure 7.1: Output of malicious class
binary 2d-images.International Journal of Bio Inspired
Computation, 9(2):77-92,2017.
[3] G. -G. Wang, X. Cai, Z. Cui, G. Min, and J. Chen. High
performance computing for cyber physical social
systems by using evolutionary multi-objective
optimization algorithm.IEEE Transactions on
Emerging Topics in Computing, 2017.
[4] Y. Ye, T. Li, D. Adjeroh, and S. S. lyengar. A survey on
malware detection using data mining techniques.
ACM Computing Surveys (CSUR), 50(3):41, 2017.
[5] J. Bouvrite. Notes on convolutional neural
networks.2006.
[6] X. Cai, X.-z. Gao, and Y. Xue. Improved bat
Figure7.3: Output of Non-Malicious class.
algorithm with optimal forage strategy and random
disturbance strategy. International Journal of Bio-
Inspired Computation, 8(4):205–214, 2016.
[7] X. Cai, H. Wang, Z. Cui, J. Cai, Y. Xue, and L. Wang.
Bat algorithm with triangle-flipping strategy for
numerical optimization. International Journal of
Machine Learning and Cybernetics, 9(2):199–215,
2018.
[8] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R.
E. Bryant. Semantics-aware malware detection. In
2005 IEEE Symposium on Security and Privacy, pages
32–46. IEEE, 2005.
[9] Z. Cui, Y. Cao, X. Cai, J. Cai, and J. Chen. Optimal
Figure7.2: Output of unknown class
leach protocol with modified bat algorithm for big
data sensing systems in internet of things.Journal of
Parallel and Distributed Computing, 2018.
(doi:10.1016/j.jpdc.2017.12.014). Z. Cui, B. Sun, G.
Wang, Y. Xue, and J. Chen. A novel oriented cuckoo
search algorithm to improve dv-hop performance for
6
Sameena S et al., International Journal of Information Technology Infrastructure , 10(3), May - June 2021, 1-7