Convolutional Layer Implementation To Classify Malware in Banking Financial Services Industry
Convolutional Layer Implementation To Classify Malware in Banking Financial Services Industry
Convolutional Layer Implementation To Classify Malware in Banking Financial Services Industry
THESIS
AZIS S PRASETYOTOMO
11060224390
During the last decade, solutions that fight against malicious software had
begun using machine learning approaches. Unfortunately, there are few open-
source datasets available for the academic community. One of the biggest
datasets available was released last year in a competition hosted on Kag-
gle with data provided by Microsoft for the Big Data Innovators Gathering
(BIG 2015). This thesis presents two novel and scalable approaches using
Convolutional Neural Networks (CNNs) to assign malware to its correspond-
ing family. On one hand, the first approach makes use of CNNs to learn a
feature hierarchy to discriminate among samples of malware represented as
gray-scale images. On the other hand, the second approach uses the CNN
architecture introduced by Yoon Kim [12] to classify malware samples accord-
ing their x86 instructions. The proposed methods achieved an improvement
of 93.86% and 98,56% with respect to the equal probability benchmark.
Acknowledgments
I would first like to thank my family, especially Mom, for the continuous
support she has given me throughout my time in graduate school. Second,
I would like to express my gratitude to my supervisor, Dr. Javier Béjar for
their guidance during the course of this thesis.
1
Contents
1 Introduction 8
1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Background 14
2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 14
2.1.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Sigmoid neuron . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 Loss function . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.4 Gradient Descent Algorithm . . . . . . . . . . . . . . . 17
2.1.5 Backpropagation . . . . . . . . . . . . . . . . . . . . . 19
2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 21
2.2.1 Local connectivity . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Convolutional Layer . . . . . . . . . . . . . . . . . . . 22
2.2.3 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Artificially expanding the training data . . . . . . . . . 26
2.4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 ReLU units . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 Gradient Descent Optimization Algorithms . . . . . . . 29
2
CONTENTS
3
CONTENTS
7 Conclusions 90
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4
List of Figures
5
LIST OF FIGURES
6
List of Tables
7
Chapter 1
Introduction
• Spyware. It is a type of malware that spies and track user activity with-
out their knowledge. The capabilities of spyware can include keystrokes
collection, financial data harvesting or activity monitoring.
8
• Worm. It is a type of malware that they spread through the computer
network by exploiting operation system vulnerabilties. The major dif-
ference between worms and viruses is that computer worms have the
ability to self-replicate and spread independently while viruses rely on
human activity to spread.
• Command & Control Bot. Bots are software programs created to au-
tomatically perform specific operations. Bots are commonly used for
DDoS attacks, spambots that render advertisements on websites, as
9
web spiders or for distributing malware. One way to defend against
bots is by using CAPTCHA tests in websites to verify users as human.
10
program capabilities and behavior can be observed either by examining its
code or by executing it in a safe environment.
11
1.1. OBJECTIVE
power has become cheaper meaning that researchers can fit large and more
complex models to data and (3) machine learning as discipline has evolved
and there are more tools at their disposal. Machine learning approaches hold
the promise that they might achieve high detection rates without the need
of human signature generation required by traditional approaches. In con-
sequence, AV companies and researchers begun to employ machine learning
classifiers to help them address this problem such as logistic regression[22],
neural networks[8] and decision trees[14].
The two principal tasks that have been carried out within the scope of mal-
ware analysis are (1) malware detection and (2) malware classification. First,
a file needs to be analyzed to detect if has any malicious content. In case
it exhibits any malicious content it is assigned to the most appropriate mal-
ware family according to their content and behavior through a classification
mechanism.
1.1 Objective
This master thesis aims to explore the problem of malware classification. In
particular, this thesis proposes two novel approaches based on Convolutional
Neural Networks (CNNs). On one hand, CNNs were applied for learning
discriminative patterns from malware images based on the work performed by
Nataraj et al.[21]. On the other hand, the CNN architecture proposed in [12]
was used to classify malicious software based on malware’s x86 instructions.
Both approaches have been evaluated on the data provided by Microsoft for
the BIG Cup 2015 (Big Data Innovators Gathering).
12
1.2. ORGANIZATION
1.2 Organization
The thesis is organized following chapters. The first and current chapter is
the introduction, which also contains the objectives and the organization of
the thesis. The second chapter introduces the background of the project,
focusing on neural networks and deep learning from its beginning until now.
The third chapter presents the state of the art review with special attention
on the machine learning algorithms and features used to detect and classify
malware. The fourth chapter introduces the Kaggle platform and the Mi-
crosoft’s Malware Classification Challenge. In addition, it also describes two
solutions of the competition. The fifth chapter describes the approach based
on the representation of malware as gray-scale images and the sixth chapter
explains how Convolutional Neural Networks can be used to extract features
from malware’s x86 instructions represented via word embeddings. Finally,
the last chapter wraps up the conclusions and the future work to be done.
13
Chapter 2
Background
• Neurons are arranged in layers, with the first layer taking in inputs and
the last layer producing the output.
14
2.1. ARTIFICIAL NEURAL NETWORKS
2.1.1 Perceptrons
A perceptron was the earliest supervised learning algorithm and it is the
basic building block of Artificial Neural Networks (ANN). It was first intro-
duced in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt.
It works by taking several inputs (x1 , x2 , ..., xj ) and producing a single output
(y). Rosenblat introduced weights (w1 , w2 , ..., wj ) to express the importance
of the respective inputs to the output. The output of the perceptron is either
q
0 or 1 and it is determined by whether the weighted sum j wj ú xj + b is
less than or greater than 0.
Y Z
_
_
_
]
1: if w ú x + b > 0 _
_
_
^
output =
_
_ _
_
_
[0 _
: if w ú x + b <= 0\
15
2.1. ARTIFICIAL NEURAL NETWORKS
Exactly as the perceptron, a sigmoid neuron has inputs (x1 , x2 , ..., xj ) and
it also has weights for each input and a bias, but the output can be a real
number. The sigmoid function is defined as:
1
‡(z) =
1 + e≠z
1
q
1 + exp(≠ j wj ú xj ≠ b)
Hence, the only difference between the perceptron and the sigmoid neuron is
the activation function.
16
2.1. ARTIFICIAL NEURAL NETWORKS
where:
The goal in training neural networks is to find weights and biases that min-
imizes some cost/loss function C. For that, it is used an algorithm called
gradient descent.
1. Start with a random initialization of each weight and bias in the NN. It
is important to randomly initialize all parameters because if not, if all
parameters start off at identical values, then all the hidden layer units
will end up learning the same function of the input. In consequence,
random initialization serves the purpose of symmetry breaking.
ˆ
l
Wi,j = Wi,j
l
≠– l
L(W, b)
ˆWi,j
17
2.1. ARTIFICIAL NEURAL NETWORKS
ˆ
bli = bli ≠ – L(W, b)
ˆbli
where – is the learning rate and Wi,j l
and bli denote each weight and
bias in a particular layer l in the NN, respectively.
ˆ 1 ÿm
ˆ
l
L(W, b) = [ l
L(W, b : xi , y i )]
ˆWi,j m i=1 ˆWi,j
ˆ 1 ÿm
ˆ
l
L(W, b) = [ L(W, b : xi , y i )]
ˆbi m i=1 ˆbli
The learning rate is used to control how big a step is taken downhill with
gradient descent. Selecting the correct learning rate is critical. On one hand,
if – is too small, gradient descent can be slow. On the other hand, if – is
too large, gradient descent can overstep the minimum and even diverge.
18
2.1. ARTIFICIAL NEURAL NETWORKS
2.1.5 Backpropagation
The key step is to compute all those partial derivates presented before. There-
fore, to compute efficiently these partial derivates is used the backpropagation
algorithm.
2. For each output unit i in layer nl compute the error term ”inl .
ˆ 1 2
”inl = nl ú ||y ≠ hW,b (x)||
ˆzi 2
j=1
19
2.1. ARTIFICIAL NEURAL NETWORKS
ˆ
l
L(W, b : x, y) = alj ú ”il+1
ˆWi,j
ˆ
L(W, b : x, y) = ”il+1
ˆbli
20
2.2. CONVOLUTIONAL NEURAL NETWORKS
CNN are composed by three types of layers: (1) fully-connected, (2) con-
volutional (3) and pooling. All the various implementations of CNN can be
loosely described as involving the following process:
3. Repeat steps 1 and 2 until you are left with enough high level features.
Figure 2.2 corresponds to the architecture used in [16] that was applied to
the ImageNet classification contest. The architecture consists of 8 learnable
layers, the first five are convolutional and the rest are fully-connected layers.
21
2.2. CONVOLUTIONAL NEURAL NETWORKS
22
2.2. CONVOLUTIONAL NEURAL NETWORKS
The size of the output after convolving a kernel of size Z over an image
N with stride S is defined as:
N ≠Z
output = +1
S
23
2.2. CONVOLUTIONAL NEURAL NETWORKS
24
2.3. OVERFITTING
2.3 Overfitting
Overfitting refers to the condition a predictive model describes the random
noise of a particular data instead of learning the underlying relationship. As
a result, these models may not yield accurate predictions for new observa-
tions.
This section describes the most common techniques used to avoid overfit-
ting in large networks.
2.3.1 Regularization
Regularization adds an extra term, named regularization term to the loss
function in a way that in consequence, the network would prefer to learn small
weights and penalize large weights. Regularization usually doesn’t affect
biases. That’s because large biases make it easier for neurons to saturate,
which is sometimes desirable. Moreover, having large biases doesn’t make a
neuron sensitive to its inputs in the same way as having large weights.
• L2 regularization.
⁄ ÿ 2
L(W, b) = L(W, b)0 + w
2n w
• L1 regularization.
⁄ÿ
L(W, b) = L(W, b)0 + |w|
n w
25
2.3. OVERFITTING
2.3.2 Dropout
In large neural networks, it is difficult to average the predictions of different
networks at test time. To address this problem, dropout was introduced in
[32] by Geoffrey Hinton. The idea behind is to randomly drop units (along
with their connections) from the neural network during training to prevent
neurons from co-adapting too much.
26
2.3. OVERFITTING
cropped, flipped and transposed in various ways to expand the training data.
27
2.4. DEEP LEARNING
28
2.4. DEEP LEARNING
Figure 2.5 presents a graphically comparison between the ReL and the sig-
moid function.
29
2.4. DEEP LEARNING
1. Batch Gradient Descent. It computes the gradients for the loss function
L(W,b) for the entire training set. It guarantees the convergence to the
global minimum for convex error surfaces and to a local minimum for
non-convex surfaces.
1. Momentum [25]
The simplest gradient algorithm known as steepest descent 2.1.4, mod-
ifies the weight at time step t according to:
ˆ
l
Wi,j = Wi,j
l
≠– l
L(W, b)
ˆWi,j
ˆ
bli = bli ≠ – L(W, b)
ˆbli
However, it is known that learning such scheme can be very slow. To
improve the speed of convergence of the gradient descent algorithm it
is included the momentum term in the formula:
ˆ
l
Wi,j,t+1 = Wi,j,t
l
≠– l
L(W, b) + “Wi,j,t≠1
l
ˆWi,j
30
2.4. DEEP LEARNING
ˆ
bli,t+1 = bli,t ≠ – l
L(W, b) + “bli,t
ˆbi
where “ is the momentum term. In consequence, the modification of
the weight vector at the step t depends on both the current gradient
and the weight change of the step t ≠ 1.
2. Adagrad [5]
Adagrad is an algorithm for gradient-based optimization that adapts
the learning rate to the parameters, performing smaller updates for
frequent parameters and larger updates for infrequent parameters.
Adagrad uses a different learning rate for every parameter Wi,j,t
l
at each
time step t. In its update rule, it modifies the general learning rate – at
each time step t for every parameter Wi,j,tl
based on the past gradients
that have been computed for Wi,j,t .
l
– ˆ
l
Wi,j,t+1 = Wi,j,t
l
≠ ú L(W, b)
Glt,ij +‘ l
ˆWi,j
where Glt,ij œ Rdxd is the diagonal matrix where each diagonal element
ij is the sum of the squares of the gradients Wi,j,t+1
l
up to time step t 24
and ‘ is smoothing term that avoids division by zero (¥ 1e ≠ 8).
3. Adam [13]
Adam is the acronym for Adaptive Moment Estimation. It is another
method that computes adaptive learning rates for each parameter. It
stores an exponentially decaying average of the past squared gradients
that we will denota vt and similar to momentum, it keeps an exponen-
tially decaying average of past gradients mt :
and gt is the gradient of the objective function and —1 and —2 are the
decay rates. mt and vt are the estimates of the first moment or mean
31
2.4. DEEP LEARNING
mt vt
m̂t = v̂ =
1 ≠ —1
t t
1 ≠ —2t
–
Wt+1 = Wt ≠ Ô ú m̂t
v̂t + ‘
32
Chapter 3
During the last decade, researchers and anti-virus vendors have begun em-
ploying machine learning algorithms like the Association Rule, Support Vec-
tor Machines, Random Forests, Naive Bayes and Neural Networks to address
the problem of malicious software detection and classification. An overview
of the methods can be found in [26], [7] and [6]. Following a few of these
approaches used in literature are discussed.
33
2. Opcodes N-grams [11, 27, 2, 3, 30]
Similar to byte-sequence N-grams, n-gram models have been generated
from opcodes extracted from assembly language code files. An opcode
(abbreviated from operation code) is the portion of a machine language
instruction that specifies the operation to be performed. In particular,
[3] investigated the most frequent opcodes and the rare opcodes present
in both goodware and malware. The following two charts show the 14
most frequent opcodes in goodware and in malware, respectively.
34
monly used in the Windows operating systems. PE format is a data
format that encapsulates the necessary information for the Windows
OS loader to manage the executable code. It includes information such
as dynamic library references for linking and API import and export
tables.
35
Figure 3.3: Outline of Invencea’s Malware Detection Framework
36
API call sequences. They used a 3rd order Markov chain, i.e. 4-grams,
to model the API calls. The malicious executables mainly consisted of
backdoors, worms and Trojan horses collected from VXHeavens. Their
detection system achieved an accuracy of 90%.
6. Use of registers
In [19], they proposed a method based on similarities of binaries behav-
iors. They assumed that the behavior of each binary can be represented
by the values of memory contents in its run-time. In other words, values
stored in different registers while malicious software is running can be
a distinguishing factor to set it apart from those of benign programs.
Then, the register values for each API call are extracted before and
after API is invoked. After that, they traced the changes of registers
values and created a vector for each of the values of EAX, EBX, EDX,
EDI, ESI and EBP registers. Finally, by comparing old and unseen
malware vectors they achieved an accuracy of 98% in unseen samples.
7. Call Graphs
A call graph is a directed graph that represents the relationships be-
tween subroutines in a computer program. In particular, each node
represents a procedure/function and each edge (f,g) indicates that pro-
cedure f call procedure g. This kind of analysis have been used for
malware classification with good results. In [15], they presented a
framework which builds a function call graph from the information
extracted from disassembled malware programs. For every node (i.e.
function) in the graph, they extracted attributes including library APIs
calls and how many I/O read operations have been been made by the
function. Then, they computed the similarity between any two mal-
ware instances.
37
8. Malware as an Image
In [21] a completely different approach to characterize and analyze ma-
licious software was presented. They represented a malware executable
as a binary string of zeros and ones. Then, the vector was reshaped into
a matrix and the malware file could be viewed as a gray-scale image.
They were based on the observation that for many malware families,
the images belonging to the same family appear to be very similar in
layout and texture.
38
Chapter 4
Microsoft Malware
Classification Challenge
39
4.2. MICROSOFT MALWARE CLASSIFICATION CHALLENGE
For each observation we were provided with a file containing the hexadec-
40
4.2. MICROSOFT MALWARE CLASSIFICATION CHALLENGE
imal representation of the file’s binary content and a file containing metadata
information extracted from the binary content, such as function calls, strings,
sequence of instructions and registers used, etc, that was generated using the
IDA disassembler tool.
• Byte Count. Two hex digits indicating the number of hex digits pairs
in the data field.
• Record Type. Two hex digits, 00 to 05, defining the meaning of the
data field.
41
4.2. MICROSOFT MALWARE CLASSIFICATION CHALLENGE
• Checksum. Two hex digits, a computed value that can be used to verify
the record has no errors.
• The rdata section. It holds the debug directory which stores the type,
size and location of various types of debug information stored in the
file.
42
4.2. MICROSOFT MALWARE CLASSIFICATION CHALLENGE
• The edata section. It contains the list of the funcions and data that
the PE file exports for other programs.
where the fields in brackets are optional. A basic instruction has two
parts: (1) the name of the instruction or the mnemonic to be executed
(also known as opcodes); (2) the operands or the parameters of the
command.
INC COUNT ; Increment t h e memory v a r i a b l e COUNT
MOV TOTAL, 48 ; T r a n s f e r t h e v a l u e 48 i n t h e
; memory v a r i a b l e TOTAL
Next you will find the top 10 most used x86 instructions in the training
dataset.
43
4.2. MICROSOFT MALWARE CLASSIFICATION CHALLENGE
44
4.2. MICROSOFT MALWARE CLASSIFICATION CHALLENGE
One particularity of the training dataset is that there are some mal-
ware samples that due to code obfuscation techniques do not have any
instruction.
#samples
Ramnit 0
Lollipop 2
Kelihos_ver3 4
Vundo 22
Simda 0
Tracur 0
Kelihos_ver1 6
Obfuscator.ACY 9
Gatak 0
The .data and .bss directives change the current section to .data or
.bss, respectively.
45
4.3. WINNER’S SOLUTION
2. Segment line count. They counted the number of lines per section in
the asm file and they also counted the number of different sections in all
malware samples which curiously was 448 a number much greater than
9, the number of sections in which an asm is usually divided. That’s
because of the application of metamorphic and polymorphic techniques.
3. Asm file pixel intensity features. Instead of representing the bytes file
as pixels they read the asm file as a binary file. They found that the
first 800 pixel intensities were very useful features.
46
4.4. NOVEL FEATURE EXTRACTION, SELECTION AND FUSION
FOR EFFECTIVE MALWARE FAMILY CLASSIFICATION
Ramnit Lollipop Kelihos_ver3 Vundo Simda Tracur Kelihos_ver1 Obfuscator.ACY Gatak
Ramnit 1541 0 0 0 0 0 0 0 0
Lollipop 1 2476 0 0 0 1 0 0 0
Kelihos_ver3 0 0 2942 0 0 0 0 0 0
Vundo 0 0 0 475 0 0 0 0 0
Simda 2 0 0 0 39 1 0 0 0
Tracur 1 0 0 0 0 750 0 0 0
Kelihos_ver1 0 0 0 0 0 0 398 0 0
Obfuscator.ACY 0 0 1 0 0 0 0 1225 2
Gatak 0 1 0 0 0 0 0 5 1007
47
4.4. NOVEL FEATURE EXTRACTION, SELECTION AND FUSION
FOR EFFECTIVE MALWARE FAMILY CLASSIFICATION
3. Entropy (ENT):
Entropy can be defined as a measure of the amount of the disorder and
it is used to detect the presence of obfuscation in malware files and for
this reason they computed the entropy of all the bytes in a malware
file.
8. Register (REG):
They computed the frequency of use of the registers in x86 architecture.
48
4.4. NOVEL FEATURE EXTRACTION, SELECTION AND FUSION
FOR EFFECTIVE MALWARE FAMILY CLASSIFICATION
bly files such as the total number of lines in .bss, .txt, .data, etc sections
or the proportion of lines in each section compared to the whole file.
Following you will find a table containing the list of feature categories and
their evaluation with XGBoost.
After the feature extraction process, they combined the features using a
version of the forward stepwise selection algorithm. The original version of
this algorithm starts with a model containing no features and then gradually
49
4.4. NOVEL FEATURE EXTRACTION, SELECTION AND FUSION
FOR EFFECTIVE MALWARE FAMILY CLASSIFICATION
increases the feature set by adding one feature at each iteration. Instead
of considering one feature at a time, they added all the subset of features
belonging to a feature category at a time, until when adding more features
didn’t increase the value of logloss. By combining the feature categories as
described earlier, they achieved a test logloss of 0.0063 positioning its solution
among the top 10 in the competition.
50
4.5. DEEP LEARNING FRAMEWORKS
1. Caffe.
It is a python deep learning framework developed by the Berkeley Vi-
sion and Learning Center. It allows you to define if train using the
CPU or the GPU easily. Caffe benefits from having a huge repository
with pre-trained neural network models suited for many problems. It
has a great implementation for convolutional networks but it has no
implementation for recurrent networks.
2. Theano.
It is a python deep learning library which make use of symbolic graph
for programming the networks. It also allows you to visualize the com-
putation graphs with d3viz.
3. TensorFlow.
It is written with a Python API over a C/C++ engine that makes it
run fast. It is more than a deep learning framework, and it has tools to
support reinforcement learning and other algorithms. In addition, Ten-
sorFlow can also be deployed in phones thanks that it can be compiled
in ARM architectures.
4. Deeplearning4j.
It is a deep learning framework developed in Java. It aims to be the
scikit-learn library in the deep learning space.
5. Torch.
It is a computational framework written in Lua that supports machine
learning algorithms. It has been used by large scale companies such as
51
4.5. DEEP LEARNING FRAMEWORKS
TensorFlow has been chosen mainly because it has a Python API, there’s
a lot of documentation available and it has a large community that it con-
tinuously develops the library. In addition, it is very easy to setup and to
learn and recently, they released TensorBoard, a tool to visualize TensorFlow
graphs and to plot some metrics such as the accuracy or the loss at each train-
ing iteration. Moreover, it provides support for distributed computing since
version 0.8 (currently 0.11).
52
Chapter 5
The next sections explain how malware can be visualized as images followed
by the architectures of the different CNNs tested and its specifications as
well as the results obtained in the Kaggle’s competition.
53
5.1. VISUALIZING MALWARE AS GRAY-SCALE IMAGES
54
5.1. VISUALIZING MALWARE AS GRAY-SCALE IMAGES
2. Lollipop. This malware shows ads in your browser and redirects your
search engine results. In addition, it tracks what you are doing on your
computer. This type of malware usually is downloaded from the pro-
gram’s website or by some third-party software installation programs.
55
5.1. VISUALIZING MALWARE AS GRAY-SCALE IMAGES
4. Vundo. This trojan is known to cause popups and advertising for rogue
antispyware programs. In addition, sometimes is used to perform denial
of service attacks and also to deliver malware to other computers.
6. Tracur. This trojan hijacks results from different search engines such
as google, youtube, yahoo, etc, and redirects to a different web page. It
also give a hacker access to your computer and can be used to download
other types of malware.
56
5.1. VISUALIZING MALWARE AS GRAY-SCALE IMAGES
8. Obfuscator.ACY. This class comprises all malware that has been ob-
fuscated to hide their purposes and to not be detected. The malware
that lies underneath this obfuscation can have almost any purpose.
57
5.1. VISUALIZING MALWARE AS GRAY-SCALE IMAGES
58
5.2. CNN ARCHITECTURES
The main problem of their approach is that it doesn’t scales well with lots
of data. Accordingly, two ways of improvement are (1) keep building more
features like SIFT, HoG, etc and (2) using another classifier like Random
Forests or SVM. Instead, our approach makes use of Convolutional Neural
Networks to learn a feature hierarchy all the way from pixels to the layers of
the classifier.
This section presents the different architectures of the network and its spec-
ifications. The details of the architectures are defined in figures 5.12, 5.13
and 5.14.
All architectures have in common the input and the output layers. On one
hand, the input layer consists of N neurons, being N the size of the training
images. The image and the height of the images varies depending on the file
59
5.2. CNN ARCHITECTURES
size and thus, before feeding the images as input all images had been down-
sampled to 32 by 32 pixels. In consequence, N is equals to 32ú32 = 1024. On
the other hand, all architectures have an output layer of 9 neurons because
the architectures are designed to handle a 9-class classification problem. In
addition, after each densely-connected layer it was applied dropout to reduce
overffiting.
60
5.2. CNN ARCHITECTURES
5.2.1 CNN A: 1C 1D
The architecture consists of:
3. Max-pooling layer.
P = 1024ú(11ú11ú64)+64+(11ú11ú64)ú4096+4096+4096ú9+9 = 39690313
where (11 ú 11 ú 64) + 64 are the shared weights for every feature map and
64 is the total number of shared bias.
61
5.2. CNN ARCHITECTURES
5.2.2 CNN B: 2C 1D
The architecture consists of:
3. Max-pooling layer.
5. Max-pooling layer.
As in the previous architecture, the input layer consists of 32x32 neurons and
is followed by a convolutional layer composed by 64 filters of size 3x3. The
62
5.2. CNN ARCHITECTURES
where (3 ú 3 ú 64) + 64 and (3 ú 3 ú 128) + 128 are the shared weights for every
feature map and 64 and 128 are the number of shared bias in the first and
second convolutional layers,respectively.
63
5.2. CNN ARCHITECTURES
5.2.3 CNN C: 3C 2D
The architecture consists of:
3. Max-pooling layer.
5. Max-pooling layer.
7. Max-pooling layer.
64
5.2. CNN ARCHITECTURES
It starts with an input layer with 32x32 neurons which is then followed by a
convolutional layer with 64 filters of size 3x3. The output of the convolutional
layer is 30x30x64 and is used to feed the following max-pooling layer that
reduces its input to 15x15x64. Next follows the second convolutional layer
with 128 filters of size 3x3. After the convolutional layer it follows the second
pooling layer that takes as input the output of the second convolutional layer
(13 ú 13 ú 128) and outputs 128 feature maps of size 7x7. Moreover, a third
convolutional layer with 256 filters of size 3x3 follows the second pooling layer
which outputs 256 feature maps of size 5x5. Additionally, a third pooling
layer follows the convolutional layer reducing the input to 256 feature maps
of size 3x3. Lastly, follows two densely-connected layers of 1024 and 512
neurons, respectively.
P = 1024ú(3ú3ú64)+64+(15ú15ú64)ú(3ú3ú128)+128+(7ú7ú128)ú(3ú3ú256)+
where (3 ú 3 ú 64) + 64, (3 ú 3 ú 128) + 128 and (3 ú 3 ú 256) + 256 are the shared
weights for every feature map and 64 and 128 are the number of shared bias
of the first, second and third convolutional layers, respectively.
65
5.2. CNN ARCHITECTURES
66
5.3. RESULTS
5.3 Results
The content of this section is structured as follows. First are presented the
results of the CNNs obtained during training and validation and then, are
presented the scores achieved in the competition.
5.3.1 Evaluation
The dataset provided by Kaggle for training was divided into two:
where N is the total size of the dataset, N = 10868 and M = 1086. The
validation set was used to search the parameters of the networks and to know
when to stop training. In particular, we stopped training the network if the
validation loss increased in 10 iterations.
The next figure shows the accuracy and the cross-entropy achieved by the
models presented in 5.2 until they reached the 100th training iteration.
67
5.3. RESULTS
(a) Training & Validation accuracy (b) Training & Validation Cross-Entropy
It can be observed that the performance of the CNN with only one con-
volutional layer performs poorly than the other nets. Next you will find the
performance of the networks on the training set at the 500th iteration.
68
5.3. RESULTS
It can be observed that the convolutional neural networks with one and three
convolutional layers had problems mainly while labeling samples from Ram-
mit, Lollipop, Tracur, Kelihos_ver1 and Obfuscator.ACY and they ended up
misclassifying some samples as belonging to the Gatak malware’s family. In
particular, the major number of misclassifications had been produced from
samples of the Lollipop family, with 98 and 33 incorrect classifications from
the convolutional net with one and three convolutional layers, respectively.
Moreover, it can be seen that the training error of the convolutional network
with two layers is lower than the other two because it greatly reduced the
number of samples misclassified as Gatak and it achieved a training accuracy
of 0.9978 very near to the one obtained by the winner’s solution (0.9987) and
a loss of 0.0231 which is also lower the obtained in [1] using only the subset of
features named IMG1 (Haralick features) and IMG2 (Local Binary Pattern
features) as represented in 4.3 which is 0.9718 & 0.1098 and 0.9736 & 0.1230,
69
5.3. RESULTS
respectively.
5.3.2 Testing
Usually, Kaggle provides a test set without label in their competitions and
the Microsoft Malware Classification Challenge is not different. Therefore,
to evaluate our models using the test set we have to submit a file with
the predicted probabilities for each class to Kaggle. These submissions are
evaluated using the multi-class logarithmic loss. The logarithmic loss metric
is defined as:
1 ÿ N ÿM
logloss = ≠ yi,j log(pi,j )
N i=1 j=1
where N is the number of observations, M is the number of class labels,
log is the natural logarithm, yi,j is 1 if the observation i is in class j and 0
otherwise, and pi,j is the predicted probability that observation i is in class
j.
This type of evaluation metric provides extreme punishment for being con-
fident and wrong. That is, if the algorithm makes a single prediction that
an observation is definitely true (1) when it is actually false, it adds infinity
to the error score making every other observation pointless. Hence, in their
competitions, Kaggle bound the predictions away from extremes by using
the following formula:
1 ÿN
logloss = ≠ (yi log(pi ) + (1 ≠ yi )log(1 ≠ pi ))
N i=1
Moreover, the submitted probabilities are not required to sum to one because
they are rescaled prior to being scored.
Additionally, submissions in Kaggle are evaluated with two scores, the public
score and the private score where the first one is calculated on approximately
70
5.3. RESULTS
30% of the test data and the second one is calculated on the other 70%.
The best results where obtained by the CNN with 3 convolutional layers
and 2 densely-connected layers which obtained a public and a private score
at iteration 500 of 0.117629734 and 0.134821767, respectively. That is an
improvement of 94.64% and 93.86% with respect to the equal probability
benchmark (logloss=2.197224577) which is obtained by submitting 1/9 for
every prediction. In contrast, other models achieved their respective lowest
score between iteration 50 and 100 which coincide with the point where the
algorithm converges into a local minima but unfortunately they were not able
to learn a better underlying relationship on the training data and ended up
performing much worse than the convolutional network with 3 convolutional
layers.
71
Chapter 6
As described in sections 4.3 and 4.4, the approaches that performed better in
the competition where those that extracted features from the disassembled
files such as n-grams counts. A n-gram is a contiguous sequence of n items
from a sentence. In our case, those items are opcodes extracted from the
disassembled files. However, the main problem that has extracting those n-
grams is that the number of features extracted increases exponentially as N
increases. In particular, a 2-gram model will result in a two-dimensional ma-
trix of size 2562 = 65536, a 3-gram model will result in a three-dimensional
matrix of 2563 = 16777216 features, a 4-gram model in a four-dimensional
matrix of 2564 = 4294967296 and so on which turns out to be very compu-
tationally expensive.
72
The network trained by Yoon Kim was a simple CNN with one layer on
top of word vectors obtained using Word2Vec, an unsupervised neural lan-
guage model. These word vectors were trained using the model proposed
by Mikolov in [20] on 100 billion words of Google News avaliable in https:
//code.google.com/p/word2vec/.
73
6.1. REPRESENTING OPCODES AS WORD EMBEDDINGS
The approaches that are based on this hypothesis can be divided broadly
into two categories:
Word2Vec [20] is a very efficient predictive model for learning word embed-
dings. Word2Vec comes with two different approaches to learn the vector
representations of words: (1) the Continuous Bag of Words (CBOW) and
(2) the Skip-Gram model. The main difference between both models is that
CBOW predicts target words from source context words while the skip-gram
model predicts source context words from target words (the context of a word
are the words to the left of the target and the words to the right of the target).
74
6.1. REPRESENTING OPCODES AS WORD EMBEDDINGS
To learn the word embeddings we used the Skip-Gram approach and thus, an
explanation of the CBOW model is not provided because it is out of scope.
As the network can’t be feed with words just as text strings is needed a
way to represent words. For that purpose, first it is build a vocabulary of
words from the malware training samples. In the case all operation codes
appear in the samples the vocabulary will consist of 665 words. Accordingly,
a word like "push" is going to be represented as a one-hot vector. This vector
will have 665 components (one for every word in the vocabulary) and in the
position corresponding to the word "push" it will place a 1 and 0s in all of
75
6.1. REPRESENTING OPCODES AS WORD EMBEDDINGS
The output layer depends on the window size. Thus, for a window size
one (just predicting one word to the left and to the right of the targets word)
the network will output a two-dimensional vector, with one dimension of the
vector containing the probabilities of the words in the vocabulary to appear
at the left of the target word and the other dimension containing the prob-
abilities of the words in the vocabulary to appear at the right of the target
word. The dimension of the hidden layer or embedding layer corresponds to
V ú E, where V is the size of the vocabulary and E is the embedding size.
1ÿ T ÿ
log p(wt+j |wt )
T t=1 ≠cÆjÆc,j”=0
where c is the size of the training context (Larger c results in more training
examples and can lead to a higher accuracy at the expense of the training
time) and p(wo |wI ) is formulated as:
exp(vwo T vwI )
p(wo |wI ) = qW
w=1 exp(vw vwI )
ÕT
where vw and vw Õ are the "input" and "output" vector representations of words
in the vocabulary and W is the number of words in the vocabulary.
76
6.1. REPRESENTING OPCODES AS WORD EMBEDDINGS
K
ÿ
log ◊(vwo T vwI ) + Ewi ≥ Pn (w)[log ◊(≠vwo T vwI )]
i=1
which is used to replace every log(wo |wI ) term in the Skip-Gram objective.
In consequence, the task is to distinguish the target word wo from draws
of the noise distribution Pn (w) using logistic regression, where there are K
negative samples for each data sample (K ƒ 5 ≠ 20). The noise distribution
Pn (w) is a design parameter. It was selected the unigram distribution U (w)
of the training data as the noise distribution because it is known to work
well for training language models. This distribution assumes that each word
in a sequence is independent and thus, each value would be independent of
the other values. In consequence, we would need to estimate the probability
of a sequence S in the malware’s language model P(S|M). The probability
r
generated for a specific sequence is calculated as follows: P (S) = Sw P (w)
To find the word embeddings it was used a window size equals to 5, meaning
that for each target word, the skip-gram approach tried to predict the five
words to the left and to the right. Following you will find the visualization
of the learned embeddings, using the t-SNE algorithm. [35]
77
6.1. REPRESENTING OPCODES AS WORD EMBEDDINGS
For instance, the opcodes whose vector representations are most similar
to the opcode "push" are:
1. pop
2. insertps
3. fucomp
which makes sense because tons of push instructions in malware files are
followed by the pop instruction or viceversa and are the two opcodes most
used. To compute the similarity between two vectors p and q it was used the
Euclidean distance.
ı̂ n
ıÿ
d(p, q) = d(q, p) = Ù (q
i ≠ pi )2
i=1
78
6.2. CONVOLUTIONAL NEURAL NETWORK ARCHITECTURE
In addition, all malware files with less than N opcodes will be filled with
UNKNOWN tokens ("UNK").
79
6.2. CONVOLUTIONAL NEURAL NETWORK ARCHITECTURE
80
6.2. CONVOLUTIONAL NEURAL NETWORK ARCHITECTURE
81
6.2. CONVOLUTIONAL NEURAL NETWORK ARCHITECTURE
82
6.3. RESULTS
6.3 Results
Next, it is detailed how the parameters of the network were selected and the
heuristic search that it was performed followed by the results obtained in
both the training and the test set.
6.3.1 Evaluation
Heuristic Search
2. Embedding size = 32
3. #filters = 64
5. Batch size = 64
Next you will find the values that were considered for each particular param-
eter of the network:
83
6.3. RESULTS
It was selected a ⁄ = 0.001 because a large learning rate can make the
gradient descent to overstep the minimum and also it has been tested
in scientific publications that the value chosen works really well.
It can be observed in the plot that both the CNN trained with E=32
84
6.3. RESULTS
and E=64 performed better than the others while performing similar
but as a higher E implies a higher training cost it was decided to select
an embedding size equals to 32.
85
6.3. RESULTS
• #filters = 64
After selecting the parameters it was trained the neural network until the
validation loss increased in 10 iterations continuously using a mini-batch size
equals to 256. However, for comparison purposes, it was decided to limit
the number of training epochs to 25. Notice that the number of training
iterations per epoch is 39. The number of training iterations per epoch is
computed as:
#iterations = N/batch_size
86
6.3. RESULTS
where N is the total size of the training dataset (N=9781). Next you will
find their respective confusion matrices and also the accuracy and the cross-
entropy of the models until epoch 25.
(a) Training & Validation Accuracy (b) Training & Validation Cross-Entropy
87
6.3. RESULTS
6.3.2 Testing
The CNN models trained during 25 epochs were used to generate the prob-
abilities of each sample from the test set to belong to a malware family.
Following you will find the public and the private score achieved by each
model.
Public Private
without pretrained word embeddings 0.048533931 0.031669778
with pretrained word embeddings 0.048851643 0.036707683
88
6.3. RESULTS
2. Then malware’s code iterates through every byte of the data that needs
to be encoded, XOR’ing each byte with the selected key.
When the attacker needs to deobfuscate the string, it repeats the step #2
XOR’ing each byte in the encoded string with the key value.
89
Chapter 7
Conclusions
This master thesis studies the problem of classifying malware into their corre-
sponding families. In order to explore the problem, we used one of the most
recent and biggest datasets publicly available which was provided by Mi-
crosoft for the BigData Innovators Gathering Cup (BIG 2015). This dataset
provides two files for each malware sample.
This thesis presents two novel and scalable approaches using Convolutional
Neural Networks to recognize the family a malware sample belongs.
90
on the observation that images of different malware samples from the
same family appear to be similar while images of malware samples
belonging to a different family are distinct. This property is useful to
classify new malware binaries that have been created by re-using old
malware. That’s because images are useful to detect small changes
while retaining the global structure and the new samples would be
very similar visually to the old ones. In consequence, in this thesis,
we studied the application of CNNs to learn a feature hierarchy all the
way from pixels to the layers of the classifier.
The first and the second approach obtained a score of 0.134821767 and
0.031669778, respectively. That is an improvement of 93.86% and 98,56%
with respect to the equal probability benchmark (logloss=2.1972245577) which
is obtained by submitting 1/9 for every prediction. Unfortunately, neither
approach outperformed the winner’s solution of the competition which ob-
tained a logloss equal to 0.002833228. That’s because their solution combined
different features such as opcode 2,3 and 4-grams as well as the number of
lines per section in the disassembled files, among others. However, the re-
sults obtained are quite promising because both approaches are able to clas-
sify malware samples much faster than all those solutions that rely on the
manually extraction of features and thus, are more scalable.
91
7.1. FUTURE WORK
92
7.1. FUTURE WORK
samples belonging to the Ramnit and the Lollipop families. That’s because
samples from both families contain more than 20.000 opcodes per file in av-
erage.
93
Bibliography
[2] Clint Feher Shlomi Dolev Asaf Shabtai, Robert Moskovitch and Yuval
Elovici. Detecting unknown malicious code by applying classification
techniques on opcode patterns. In Security Informatics. 2012.
[5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient
methods for online learning and stochastic optimization. J. Mach. Learn.
Res., 12:2121–2159, July 2011.
[6] Ekta Gandotra, Divya Bansal, and Sanjeev Sofat. Malware analysis and
classification: A survey. Journal of Information Security, pages 56–64,
2014.
94
BIBLIOGRAPHY
[7] Dragos Gavrilut, Mihai Cimpoes, Dan Anton, and Liviu Ciortuz. Mal-
ware detection using machine learning. Proceedings of the International
Multiconference on Computer Science and Information Technology, page
735–741, 2009.
[8] Li Deng George E. Dahl, Jack W. Stokes and Dong Yu. Large-scale
malware classification using random projections and neural network.
ICASSP, 2013.
[11] Javier Nieves Yoseba K. Penya Borja Sanz Igor Santos, Felix Brezo
and Carlos Laorden. Opcode-sequence-based malware detection. In
Engineering Secure Software and Systems, volume 5965.
[13] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic
optimization. CoRR, abs/1412.6980, 2014.
[14] J.Z. Kolter and M.A. Maloof. Learning to detect and classify malicious
executables in the wild. Journal of Machine Learning Research, page
2721–2744, 2006.
95
BIBLIOGRAPHY
[17] Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. Efficient
mini-batch training for stochastic optimization. In Proceedings of the
20th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD ’14, pages 661–670, New York, NY, USA, 2014.
ACM.
[18] Robert Lyda and James Hamrock. Using entropy analysis to find en-
crypted and packed malware. IEEE Security and Analysis, 5:40–45,
2007.
[19] Zahra Salehi Mahboobe Ghiasi, Ashkan Sami. Dynamic malware detec-
tion using registers values set analysis.
[20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff
Dean. Distributed representations of words and phrases and their com-
positionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani,
and K. Q. Weinberger, editors, Advances in Neural Information Process-
ing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
[22] Anil Thomas Nikos Karampatziakis, Jack Stokes and Mady Marinescu.
Using file relationships in malware classification. Detection of Intrusions
and Malware, and Vulnerability Assessment, 7591:1–20, 2013.
[23] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A
holistic representation of the spatial envelope. International Journal of
Computer Vision, 42:145–175, 2001.
96
BIBLIOGRAPHY
[24] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A
holistic representation of the spatial envelope. Int. J. Comput. Vision,
42(3):145–175, May 2001.
[25] Ning Qian. On the momentum term in gradient descent learning algo-
rithms. Neural Netw., 12(1):145–151, January 1999.
[26] Smita Ranvee and Swapnaja Hiray. Comparative analysis of feature ex-
traction methods of malware detection. International Journal of Com-
puter Applications, 120, 2015.
[27] Clint Feher Nir Nissim Robert Moskovitch, Dima Stopel and Yuval
Elovici. Unknown malcode detection via text categorization and the
imbalance problem. IEEE International Conference on Intelligence and
Security Informatics, pages 156–161, 2008.
[28] Joshua Saxe and Konstantin Berlin. Deep neural network based malware
detection using two dimensional binary program features, 2015.
[29] Sjsu Scholarworks and Donabelle Bays. Structural entropy and meta-
morphic malware, 2013.
[32] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. Dropout: A simple way to prevent neural net-
works from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, January
2014.
97
BIBLIOGRAPHY
[33] Kephart J.O. Tesauro, G.J. and Gregory B Sorkin. Neural networks for
computer virus recognition. IEEE International Conference on Intelli-
gence and Security Informatics, 11, 1996.
[35] L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional
data using t-sne. 2008.
[36] Nitin Rai Veeramani R. Windows api based malware detection and
framework analysis. International Journal of Scientific & Engineering
Research, 3, 2012.
[37] Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, and
Min Zhao. Sbmds: an interpretable string based malware detection
system using svm ensemble with bagging, 2009.
98