Lu Qikai 202109 MSC
Lu Qikai 202109 MSC
Lu Qikai 202109 MSC
Classification
by
Qikai Lu
Master of Science
in
ii
Preface
iii
Acknowledgements
I would like to thank Dr. Niu for helping me through my research. His
supervision had helped me greatly in overcoming the difficulties encountered
during my research. Furthermore, his guidance offered me lasting insights on
how to become a good researcher.
I would also like to thank my family, who offered me support through these
years. They have offered me encouragements during these times, allowing me
to pursue my research with full effort.
iv
Table of Contents
1 Introduction 1
1.1 Problem Motivation . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Malware Classification . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Related Work 5
2.1 Malware Classification by Raw Bytes . . . . . . . . . . . . . . 5
2.1.1 Byte Sequence Classification . . . . . . . . . . . . . . . 5
2.1.2 Malware Image Classification . . . . . . . . . . . . . . 6
2.2 Transformers in Malware Classification . . . . . . . . . . . . . 8
2.3 Two-Stage Framework for Malware Classification . . . . . . . 9
3 Model Designs 10
3.1 Background on Transformer . . . . . . . . . . . . . . . . . . . 10
3.2 SeqConvAttn: Byte Sequence Classifier . . . . . . . . . . . . . 12
3.3 ImgConvAttn: Malware Image Classifier . . . . . . . . . . . . 13
3.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . 14
3.3.2 Image Generation . . . . . . . . . . . . . . . . . . . . . 15
3.4 Two-Stage Framework . . . . . . . . . . . . . . . . . . . . . . 17
v
4.1.1 BIG 2015 . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Sub-BODMAS . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Test Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Test Environment . . . . . . . . . . . . . . . . . . . . . . . . . 21
6 Two-Stage Experiments 37
6.1 Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Results on Two-Stage Framework . . . . . . . . . . . . . . . . 38
Bibliography 44
vi
List of Tables
vii
List of Figures
viii
List of Variables
Transformer-based Model
ix
Two-Stage Framework
x
Chapter 1
Introduction
1
objective of real-time malware classification.
2
information. We note, from our survey of related works, that barring some
exception, such as [4] and [14], most classifiers are CNNs.
3
experimental observations, we also added a file-size-aware mechanism which
pre-emptively diverts certain binary files directly to the second stage to opti-
mize the framework, further reducing the inference latency while maintaining
high classification accuracy.
We evaluated the performance of the proposed models on two datasets, the
BIG 2015 Dataset from Microsoft Malware Classification Challenge [18] and
Sub-BODMAS, a select subset of the BODMAS Malware [19] Dataset. Our ex-
periments show that SeqConvAttn attained accuracy and weighted-F1 scores
supperior to most baseline models on sequence-based classification. ImgCon-
vAttn is also shown to be superior to the CNN baseline in image-based clas-
sification. We then evaluated the two-stage framework on the Sub-BODMAS
dataset and showed that as compared to independent baseline models. The
two-stage framework maintained a very high accuracy while reducing inference
latency significantly.
4
Chapter 2
Related Work
5
tify the weakness of the architecture. [25] conducted activation analysis to
gain further insights about the information learned from byte sequences. The
analysis concluded that filters from low-level convolutional layers were able to
identify salient ASCII and instruction sequences. [26] then extended Malconv
to multi-class classification on the BIG 2015 Dataset [18]. Finally, [5] intro-
duced two major improvements on the original Malconv architecture. First,
a convolution-over-time scheme was introduced, allowing Malconv to process
binary files of arbitrary sizes in an efficient manner. Second, to better model
dependencies between distant elements in the byte sequence, they proposed
the Global Channel Gating mechanism.
Alternative to Malconv, [4] and [14] have also introduced CNN-BiLSTM
and CNN-BiGRU, respectively, to process raw bytes. To handle the length
of the byte sequence, these approaches directly handled the byte sequences as
1D greyscale images, and resized the images to a length of 10,000 elements.
More importantly, these papers demonstrate the potential of applying NLP
architectures to malware classification.
6
directly converted into RGB image, and re-sized to the specified input dimen-
sions. [12] repeated this approach, but replacing ResNet50 with VGG16 [28].
[29] further proposed integrating attention mechanism into CNN, both to im-
prove classification accuracy and to identify salient sections in the greyscale
image for analysis. [30] applied Inception-v3 [31] and Inception-ResNet-v2 [32]
for android malware detection. [33] then further investigated transfer learning
with ResNet50. In general, there is a clear trend of employing deep network
for achieving greater accuracy. However, this also translates to higher per-file
inference latency. On Table 2.1, we report the CPU latency for single image
feedforward through common deep models based on their TorchVision imple-
mentation. While the latency could be significantly reduced by running these
models on sufficiently powerful GPU, the environment firewall and end-point
security software are installed in may lack such hardware, rendering them
unsuitable for many applications.
Alternative to greyscales, some recent works have also proposed generating
malware images by analyzing byte frequency. [7] proposed generating malware
image through Markov image (transition matrix) rather than greyscales. The
markov image has a dimension of 256 × 256, with each pixel representing the
transition probability from one byte value to another in the byte sequence.
[8] further experimented with training a deep CNN model from scratch using
Markov images. [13] devised an architecture to accept the Markov image along
with the greyscale and RGB images, augmenting the information extracted
from the binary file. [9] investigated an alternative by directly recording the
frequency count of each byte bigram, and then performing discrete cosine
transform to desparsify the image.
We curtly note the existence of other byte-level feature engineering tech-
niques for extracting texture statistics from greyscale images. For example,
[34] experimented with using Gabor filter, histogram of oriented gradient, and
local binary pattern analysis to enhance texture information of the greyscale.
[35] proposed a texture partitioning and extraction technique to omit non-
important regions of a greyscale image representation. [2] devised second order
texture features by statistical methods. However, these methods currently do
7
not appear to be widespread.
8
2.3 Two-Stage Framework for Malware Clas-
sification
A number of works have investigated using two-stage frameworks for malware
classification. However, the purpose of the stages differs greatly between de-
signs. [40] implemented the framework to perform tiered classifications: the
first stage detects malware from benignware, the second stage classifies the
malware identified from the first stage. The designs of [41] and [42] are func-
tionally similar, but specialized their second tier to only discerning whether the
malware is a ranswomware. Echelon, developed by [43], uses the second stage
to double-check software that are predicted benign by the first stage, reducing
the overall false negative rate, where actual malware are predicted as benign.
TuningMalconv, [44], uses the second stage to reclassify a malware only if the
first stage classification is uncertain, with the intent of boosting the accuracy of
the first stage without significantly increasing the average latency. TAMD [45]
is functionally similar, but architecturally more comprehensive. The design
employs model ensembles, rather than single models, in both stages. Addi-
tionally, the framework is designed to facilitate both efficient model training
and classification.
9
Chapter 3
Model Designs
XWQ (XWK )T
SA(X) = sof tmax( √︁ )XWV (3.1)
dkqv
10
its corresponding query element and the every key element. By carrying out
the entire self-attention computation as a series of matrix multiplication, all
target element encodings are thus computed concurrently.
To learn the different types of inter-dependencies that may exist within
the sequence, [15] proposed employing multiple scaled dot-product attention
heads in parallel. As the weight of each attention head is initialized differently,
different heads could potentially capture different type of inter-dependency in
the sequence. To combine the information learned from the parallel heads,
their outputs are concatenated along the encoding dimension and re-projected
to a final encoding. The entire design is referred as multihead attention, with
the mathematical definition is presented by 3.2.
Here, H refers to the number of parallel attention heads, and WO ∈ RHdkqv ×dm
the post-concatenation projection weight. Note that in most transformer de-
signs, the model dimension stays invariant after undergoing multihead atten-
tion. Thus, for all subsequent models, dm = Hdkqv .
A conventional transformer architecture is composed of multiple encoder
blocks connected in a serial fashion, Each block contains a multihead attention
component followed by a feedforward component, with interjecting residual
connection after each component. The feedforward blocks consists of a ReLU
activated expansion layer followed by restorative layer, as shown by Equation
3.3 [15].
FF (X) = ReLU (XW1 + b1 )W2 + b2 (3.3)
Note that the expansion layer parameters W1 ∈ Rdm ×Bdm , b1 ∈ RBdm and the
restorative layer parameters W2 ∈ RBdm ×dm , b2 ∈ Rdm . Here, B is referred to
as an expansion factor.
Transformers have no implicit awareness of the positions of elements in
the sequence. To remedy this, [15] proposed adding positional encoding to
the initial sequence encoding before undergoing self-attention. The positional
encodings of the two proposed models differ. Thus, further details are deferred
to description about the individual models.
11
Figure 3.1: Note that the K transformer blocks are connected serially.
12
Here n ∈ Z ∩ [1, N ] indicate the positional index of the an element in the se-
quence, and id ∈ Z ∩ [1, dm ] the index of the encoding dimension. Essentially,
this positional encoding expresses positional context through a set of alternat-
ing sinusoids. After adding positional encoding, the resultant sequence X then
propagates through a series of K transformer blocks. We refrain from further
exposition about transformers here, as details are already presented in Section
3.1. Upon obtaining the transformer output encoding, it is then max-pooled
element-wise into a fixed dimension vector of dimension dm . This vector then
proceeds through additional fully-connected layers. The final output then un-
dergoes softmax, yielding the classification probability.
One potential limitation of this architecture is that the SeqConvAttn, once
trained, cannot adapt to the classification new malware classes without mod-
ification. The reason lies with the final layer of the fully connected-layers
block, whose width needs to match the number of classes specified by the
user. However, the actual modification necessary to accommodate changes
in the number of malware classes is relatively minimal. Specifically, the final
layer in the fully-connected layers block is replaced with a layer whose width
matches the number of malware classes. Optionally, all parameters in the
fully-connected layer are then re-initialized. Finally, the modified SeqConvA-
ttn model is then finetuned with additional data to learn to predict the newly
specified malware classes.
13
Figure 3.2: Note that the append and detach blocks have no learnable param-
eter, but used to indicate modifications to the sequence encoding.
14
the sequence. The purpose of this [SOS] token is to efficiently encapsulate
information within of the entire sequence into a single vector of length dm .
Position encoding are then added to preserve positional context of the patches.
According to [17], the existence of positional encoding is more important than
its type. Their experiments did not show noticeable advantage of using one
particular type of positional encoding over another, so long as some type of
positional encoding is used. Thus, we followed the implementation of [46],
where the positional encoding of ImgConvAttn is designed to be learnable
parameters, such that it can adapt to the model during training.
The resultant sequence encoding then passes through a number of a trans-
former encoder blocks. We refrain from further exposition about transformers
here, as further details are present in Section 3.1. Upon obtaining the trans-
former output sequence, only the [SOS] encoding is kept as the latent image
representation, and the rest of the sequence discarded. The idea is that af-
ter multiple layers of self-attention, the finalized [SOS] encoding should retain
sufficient context about the original image. To generate the final classification,
the [SOS] encoding propagates through additional fully connected layers, and
then undergoes softmax to generate classification probability.
As with SeqConvAttn, ImgConvAttn can be modified and finetuned with
minimal effort to adapt to changes in the malware classes. The exact proce-
dures are exactly the same as those introduced for SeqConvAttn in Section
3.2.
15
Figure 3.3: Top: The basic two-stage framework proposed. Bottom: A vari-
ant with an additional file-size-aware mechanism, which pre-emptively diverts
large binary files directly to the second-stage.
16
3.4 Two-Stage Framework
Theoretically, SeqConvAttn and ImgConvAttn are designed to be functionally
complementary. Whereas SeqConvAttn should be more accurate, ImgConvA-
ttn should be faster. We draw inspiration from [44] and [45], and devise a
two-stage framework to leverage the advantage of both models, as shown by
the design on the upper portion of Figure 3.3. The underlying intent is to
avoid unnecessarily running the slower SeqConvAttn if ImgConvAttn can gen-
erate a confident prediction. We assign ImgConvAttn as the first-stage, with
an expected per-file inference latency of t1 . SeqConvAttn is then assigned
as the second-stage, with a latency of t2 . Given a binary file, ImgConvAttn
would conduct the initial classification. The classification uncertainty is then
checked against a threshold value υ. If the uncertainty is below the threshold,
the binary file is assigned to the class predicted by ImgConvAttn, concluding
the classification process. However, if the uncertainty exceeds the threshold,
the binary is subjected to reclassification by SeqConvAttn. Assuming that
the ImgConvAttn is sufficiently confident in its predictions most of the time,
the majority of binary files should only incur a inference latency of t1 , while
the minority would incur a latency of t1 + t2 . In our design, classification
uncertainty is defined as Equation 3.5.
Here, Cpred and Upred refers to, respectively, the predicted class and the clas-
sification uncertainty. P refers to the probability ”operator”.
Proper setting of uncertainty threshold υ offers meaningful control in the
tradeoff between latency and accuracy of the two-stage framework. Thus we
introduce a simple approach to determining the υ for a specified latency re-
quirement using a set-aside development (validation) set. Consider an arbi-
trary uncertainty threshold υp , such that p% of files are expected to undergo
SeqConvAttn reclassification. For a given latency constraint tspec , we could
solve for the percentage of the files permitted for reclassification by Equation
17
3.6.
tspec − t1
p% = (3.6)
t2
Once p is solved, the corresponding υp could then be experimentally deter-
mined by assessing the classification uncertainties of ImgConvAttn on the
development set. Specifically, we set υp to the (100 − p)th percentile, such
that only the p% most uncertain cases proceed to SeqConvAttn. Assuming
that the development set is sufficiently representative of the test environment,
a framework with uncertainty threshold υp should approximately meet the
latency constraint of tspec .
We curtly note that the lower Figure 3.3 is a variant of the two-stage frame-
work. This variant contains a supplementary conditional that pre-emptively
redirects large binaries files that are likely to incur a latency t1 ≥ t2 to directly
undergo second-stage classification. Note that this variant is devised based on
experimental observation. Thus, we defer further discussion of this variant to
Section 6.1.
18
Chapter 4
4.1 Datasets
4.1.1 BIG 2015
The BIG 2015 [18] was originally provided for the Microsoft Malware Classi-
fication Challenge. While the original dataset consists of a ”train” and ”test”
partition, only the ”train” set is labelled. We partitioned the 10868 labelled
malware binaries in ”train” set into disjoint train, validation, and test set. We
present the statistics of partitioned datasets on Table 4.1.
Table 4.1: Statistics of BIG 2015 Dataset
For each malware, the original dataset provided its hexadecimal representa-
19
tion. Thus, we converted the hexadecimals into bytes, generating the binary
file. For some of these files, some hexadecimals are of the value ”??”. These
instances were dealt with by interpreting all ”??”s as ”00”s during the conver-
sion process. Note that the resultant binary files are sterilized, with the PE
headers removed by the original vendor before distribution.
For subsequent experiments, we do not directly record the reported per-
formance on Big 2015 from surveyed publication into the results table. As
the test sets differs between different publications and our work, we consider
direct comparisons infeasible. However, where necessary, we will reference
results from surveyed publications for explanation purposes.
4.1.2 Sub-BODMAS
The original BODMAS dataset [19] contains of 57,293 malware binaries, be-
longing to one of the 581 malware families. However, the majority of mal-
ware classes possess insufficient number of malware samples for meaningful
assessment. Consequently, only a subset of the malware files are selected for
experimentation. Specifically, we only retrieved files whose malware class con-
tains more than 1000 instances timestamped from and after January 1, 2020.
The resultant dataset contains 23065 malware binaries in total, drawn from
11 classes. The subset is then partitioned into disjoint train, validation, and
test sets. We refer to the resultant dataset as Sub-BODMAS. Statistics on
Sub-BODMAS is presented on Table 4.2.
20
Table 4.2: Statistics of Sub-BODMAS Dataset
class.
• Latency is the average duration between when the binary file content is
loaded into RAM and when the predicted classification is obtained. For
assessment of independent models, this accounts for the byte sequence
preprocessing (such as truncation, image conversion) and model feedfor-
ward time. For the two-stage framework, this also includes time taken
by the latency control measures. Note that to address portability issues,
the latency assessment is done by running the classifiers on CPUs only.
21
Table 4.3: Test Environment Specification
Setting Specification
Python 3.6.9
Pytorch 1.7.1+cu110
Cuda 11.2
OS Ubuntu 18.04.5 LTS
CPU AMD Ryzen 9 3900X 12-Core Processor
Memory 64 GB
GPU GeForce RTX 2080 TI
22
Chapter 5
Independent Model
Experiments
23
• Malconv [3]: Our Malconv implementation is taken from existing Github
code [47]. Note that for analysis purposes, we adjusted the original
convolutional kernel and stride size from 512 to 500.
24
5.1.2 Experiment Results
25
in accuacy and weighted-F1. [4] reported an accuracy of 98.20% using the
CNN+BiLSTM model. However, we were not able to replicate such per-
formance using our implementation of CNN+BiLSTM*. Potentially, this is
caused by difference in sequence resizing algorithm or model designs, as in-
formation from the original publication is insufficient for model replication.
However, unless further information becomes available, we consider SeqCon-
vAttn as superior for classification accuracy.
26
5.1.3 Visualization of SeqConvAttn Attention
We further investigate the features learned by the SeqConvAttn model. Recall
XWQ (XWK )T
that the attentional weights correspond to the sof tmax( √ ) matrix
dkqv
in Equation 3.1. Additionally, note that each element in the post-convolution
sequence corresponds to a segment of 500 bytes. A high attention value at
row i and column j of the matrix thus suggests a salient dependency of byte
segment i on byte segment j. To visualize such inter-dependencies between
segments, attention maps of all attention heads in the final transformer block
are elementwise averaged into a single attention map. The resultant maps
for two malware samples are presented at the top of Figure 5.1. Note that
the softmax of Equation 3.1 is applied horizontally across the attention map.
Furthermore, natural log was applied to all values in the matrix before image
display for better visualization effect.
From the SeqConvAttn attention maps, several vertical highlights, or bright
green streaks, can be identified. Additionally, several faint horizontal lines are
also noted. That the column index of the vertical highlights and the row index
of the horizontal lines are the same is no coincidence, as they both address
the same sequence byte segments. Based on Equation 3.1, the vertical high-
lights indicate that most byte segments in the post-convolution sequence have
strong dependency to the highlighted byte segment. On the other hand, the
faint horizontal highlights indicate that for these byte segments, there are no
strong dependencies with respect to other byte segments in the sequence, as
the attentional weight is approximately equal across the sequence. Summar-
ily, the highlighted byte segments likely contain salient information critical to
malware identification. To some extent, the predominance of these highlights
for different byte sequences is somewhat unexpected, as it may imply that few
inter-dependencies exists between distant byte segments. However such impli-
cation is against intuition. Consider that in assembly code, conditional and
jump branches allow programs to execute instructions that are nonconsecutive
on the binary level, which can be interpreted as a form of dependency. The
more likely reason for the absence of inter-dependency is that by compressing
27
Figure 5.1: Left: The SeqConvAttn Attention Map (Top) and Malconv Gating
Map (Top) of a SillyP2P file. Left: The SeqConvAttn Attention Map (Top)
and Malconv Gating Map (Top) of a Berbew file.
segments of 500 bytes into single elements through 1D convolution, the resul-
tant element encodings likely lost inter-dependencies information contained in
the byte segments.
We further compared the attention map of SeqConvAttn to the gating map
in Malconv. Akin to the attention weights in SeqConvAttn, Malconv employs
gated convolution [20] to filter byte segment information for classification. The
bottom of Figure 5.1 displays the gating maps computed for the same malware
samples as that of the attention maps. Note that it is difficult to divulge from
the gating map the emphasis or suppression of information from a particular
byte segment. Unlike attention maps, whose value suppresses or emphasize
the entire element encoding, values of gating map can independently suppress
specific sections (along the dimension) of the element encoding. This hinders
the interpretability of Malconv, as salient byte segments cannot easily identi-
fied from the gating map. The best can be discerned from the gating maps
are the presence of different byte sections, indicated by the different texture
28
patterns, Potentially, this is indicative of different types of information being
presented by different byte sections. However, unlike SeqConvAttn attention
maps, it is difficult to identify individual byte segments that are salient for
classification. From this, we note another advantage of SeqConvAttn, that its
attention map is more readily interpretable for human analysis.
Comparing attention maps against gating maps, it is noted that dense sec-
tions of high attention byte segments in the former often corresponds to partic-
ular binary region of the latter. We suspect then, that the underlyng features
learned by ConvSeqAttn and Malconv are likely similar for many cases. This
potentially explains the close accuracy score attained on the Sub-BODMAS
dataset between the two models, as shown in Table 5.2. The fact that both
SeqConvAttn and Malconv appear to retrieve similar information, despite the
feature extraction mechanism being relatively different, suggests the limita-
tion of using convolution-based dimension reduction approach. Essentially,
the ability of transformers to model inter-dependencies between any pair of
elements is not applicable in most instances due to information lost during
1D convolution. Improvements in feature engineering, such that dependencies
between features are better enhanced, should be subjected to future investi-
gation.
29
is trained for 100 epochs, with the checkpoint yielding the highest validation
accuracy selected as the optimal version.
The baseline for comparison against ImgConvAttn is the 3C2D model [9].
The original design, proposed by [10], is a shallow CNN consisting of 3 convolution-
and-max-pooling layers and 2 fully connected layers with training dropout. [9]
added dropouts to the fully connected layer to improve model robustness. This
baseline is implemented based on description provided by [9]. Note that we did
not implement additional baselines with deep models to minimize the inference
latency. In addition, for each model, we experiment with three different types
of image: bigram frequency, Markov, and greyscale.
Two observations can be taken from experimental results of the BIG 2015.
First, the ImgConvAttn model is superior to the 3C2D model in classification
accuracy for all image types. While ImgConvAttn is generally 2-3 ms slower
than 3C2D, the average inference latency is nevertheless quite fast. Second, on
ImgConvAttn, between the different malware image types, bigram frequency
generates the best results, significantly outperforming greyscale and surpass-
ing Markov images in accuracy and weighted-F1. The synergy of these two
observations is demonstrated by the ImgConvAttn-Frequency model, which
impressively achieved a higher accuracy than SeqConvAttn, while only incur-
ring a quarter of the latency of the latter.
30
Table 5.4: Results of Imaged-based Classification on Sub-BODMAS
31
Sub-BODMAS, the selection of one architecture over another reflects a trade-
off between accuracy and speed. This is the expected scenario based on the
original design philosophy. Thus, we then experimented leveraging the advan-
tage of both using the two-stage framework.
32
troduces relationship between bytes that are not existent in the original 1D
format. Such relationship could be easily altered by shifting the position of
certain byte sections or interjecting bytes at some locations, resulting in sig-
nificant variation in the resultant greyscale. Second, the byte value cannot
be interpreted as a greyscale value. This is because distinct byte values, such
as 12 (0x0C) and 255 (0xFF) do note have a comparative relationship that
defines one value as greater than the other. This issue becomes relevant when
resizing the greyscale, as the interpolation from image pixels has no meaning-
ful explanation. Note that from our survey, some publications did achieve very
high accuracy using greyscale only. For example, the transfer-learned MCFT-
CNN [33], based on Resnet50 [27], reported an accuracy of 98.63%. Thus,
the greyscale-based classification is not necessarily discredited as an effective
type of malware image. However, its lack of an intuitive justification must be
acknowledged.
Compared to greyscale images, the bigram frequency images from differ-
ent Upatre samples appear, at least superficially, to be relatively consistent.
Overlaying the malware ImgConvAttn attention maps onto their respective
frequency images, a number of salient patches are appear common between
the instances. For example, using notation [row, column], patches [0, 3], [0,
7], [0, 10], [0, 12], are highlighted to some extent in all three samples. As
each patch encapsulate the frequency of 256 distinct bigrams, this suggests
that certain bigrams within the highlighted sub-images are salient for mal-
ware identification, with the more frequent bigrams being the priority candi-
dates. The potential correlation between bigrams and malware class can also
be intuitively justified. The execution of malicious activity may rely on partic-
ular instruction sets, which remains consistent between different samples. The
presence of these instruction sets may be reflected by particular occurrence
frequency of specific bigrams. As the bigram frequency is a matrix, the entry
corresponding to a particular bigram occupies the same pixel for all images.
Thus, salient information can be captured for more easily by ImgConvAttn,
leading to higher classification accuracy.
33
Figure 5.2: Top: The relation between file size and inference latency for
ImgConvAttn-Greyscale. Note that the dashed red line separates the data-
points into two disjoint subsets. A linear equation is then derived for each
subset. Bottom: The relation between file size and inference latency for
ImgConvAttn-Frequency.
34
Figure 5.3: Left: Generated Greyscale Image. Right: Visualization of Img-
ConvAttn Attention Map. Note that reach row corresponds to a different
Upatre sample.
35
Figure 5.4: Left: Generated Bigram Frequency Image. Right: Visualiza-
tion of ImgConvAttn Attention Map. Note that reach row corresponds to a
different Upatre sample.
36
Chapter 6
Two-Stage Experiments
37
on uncertainty threshold setting, which maps to specific uncertainty thresholds
determined from the validation set. Note that while in real-world application,
we expect that the uncertainty threshold υ is selected based on a specified
latency constraint tspec using Equation 3.6. However, for this experiment,
uncertainty threshold υ is selected based on p%, based on development (val-
idation) set performance, for ease of comparison. We list these uncertainty
thresholds on Table 6.1.
The model pairs integrated in the two-stage framework are the following:
38
Table 6.2: Results of Two-Stage Framework Classification on Sub-BODMAS
For all designs, it is observed that the adjustment of the uncertainty thresh-
old υ does offer control between the accuracy and latency for the two-stage
frameworks, with both metrics monotonically increasing with smaller uncer-
39
tainty threshold (larger p%). However, at some point, there is always a
clear effect of diminishing return observed. This effect appears when p ∈
{15, 20, 25, 30}. At this range, further increase in latency corresponds to suc-
cessively smaller gain in accuracy. The reason for this phenomenon is that with
smaller uncertainty thresholds, binary files with more certain classification are
also subjected to second-stage reclassification. As the first-stage classifica-
tion with higher certainty are more likely to be correct, reclassification by the
second-stage would likely yield the same prediction. This redundancy thus
results in the observed diminishing returns.
Examining between ICAGrey-SCA and ICAFreq-SCA frameworks, former
is significantly faster. However, the accuracy of ICAGrey-SCA is left wanting,
with ICAGrey-SCA-25% not even matching the accuracy of ICAFreq-SCA-5%,
or that of most of the sequence-based classifiers. This observation is also noted
with the file-size-aware two-stage frameworks, where ICAGrey-SCA-fsa is still
inferior to ICAFreq-SCA-fsa on accuracy. Examining the latency difference
between ICAGrey-SCA and ICAGrey-SCA-fsa, a moderate latency decrease
of 1-3 ms is achieved by the latter design. However, a significant latency reduc-
tion of about 10 ms is noted between the ICAFreq-SCA and ICAFreq-SCA-fsa.
This makes sense, as the latency of bigram frequency image generation scales
worse against file size when compared to greyscale, this results more files di-
verted directly to the second-stage. Comparing the latency ICAGrey-SCA-fsa
with ICAFreq-SCA-fsa, the speed advantage of the former is effectively nulli-
fied. We conclude that despite the two-stage model being capable of offsetting
the inferior accuracy of the first-stage model through occasional reclassifica-
tion of uncertain cases by the second-stage model, the inherent accuracy of
the first-stage model nevertheless greatly affects the overall performance of
the two-stage framework.
A peculiarity noted from the ICAFreq-SCA and its file-size-aware variant
is that when uncertainty threshold υ is sufficiently low, the accuracy attained
by the two-stage framework exceeds the accuracy of either of its component
model. Recall that the accuracy attained by ImgConvAttn and SeqConvAttn
on BODMAS are, respectively, 95.98 and 96.92. ICAFreq-SCA-20% slightly
40
surpassed those models with an accuracy of 96.96, and ICAFreq-SCA-fsa-20%
surpassed that still with 97.00. We suspect that this improvement reflects
that the two-stage model, in some instances, operates as an ensemble model.
Specifically, for binary files with uncertain initial classification, both models
become involved in generating the final prediction.
41
Chapter 7
• The two-stage framework can effectively control the average per-file in-
42
ference latency. Additionally, by minimizing occasional slowdowns of the
first-stage model through the file-size-aware mechanism, superior accu-
racy could be maintained while lowering the inference latency.
43
Bibliography
44
[11] E. Rezende, G. Ruppert, T. Carvalho, F. Ramos, and P. De Geus, “Ma-
licious software classification using transfer learning of resnet-50 deep
neural network,” in 2017 16th IEEE International Conference on Ma-
chine Learning and Applications (ICMLA), IEEE, 2017, pp. 1011–1014.
[12] E. Rezende, G. Ruppert, T. Carvalho, A. Theophilo, F. Ramos, and
P. de Geus, “Malicious software classification using vgg16 deep neural
network’s bottleneck features,” in Information Technology-New Genera-
tions, Springer, 2018, pp. 51–59.
[13] A. Pinhero, M. Anupama, P. Vinod, C. A. Visaggio, N. Aneesh, S. Ab-
hijith, and S. AnanthaKrishnan, “Malware detection employed by visu-
alization and deep neural network,” Computers & Security, p. 102 247,
2021.
[14] S. Venkatraman, M. Alazab, and R. Vinayakumar, “A hybrid deep learn-
ing image-based analysis for effective malware detection,” Journal of
Information Security and Applications, vol. 47, pp. 377–389, 2019.
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L
. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in neural information processing systems, 2017, pp. 5998–6008.
[16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
[17] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T.
Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al.,
“An image is worth 16x16 words: Transformers for image recognition at
scale,” arXiv preprint arXiv:2010.11929, 2020.
[18] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi, “Mi-
crosoft malware classification challenge,” arXiv preprint arXiv:1802.10135,
2018.
[19] L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “Bodmas:
An open dataset for learning based temporal analysis of pe malware,”
in Proceedings of Deep Learning and Security Workshop (DLS), in con-
junction with IEEE Symposium on Security and Privacy (IEEE SP),
2021.
[20] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language model-
ing with gated convolutional networks,” in International conference on
machine learning, PMLR, 2017, pp. 933–941.
[21] M. Krčál, O. Švec, M. Bálek, and O. Jašek, “Deep convolutional malware
classifiers can learn from raw executables and labels only,” 2018.
[22] O. Suciu, S. E. Coull, and J. Johns, “Exploring adversarial examples
in malware detection,” in 2019 IEEE Security and Privacy Workshops
(SPW), IEEE, 2019, pp. 8–14.
45
[23] B. Kolosnjaji, A. Demontis, B. Biggio, D. Maiorca, G. Giacinto, C. Eck-
ert, and F. Roli, “Adversarial malware binaries: Evading deep learning
for malware detection in executables,” in 2018 26th European signal pro-
cessing conference (EUSIPCO), IEEE, 2018, pp. 533–537.
[24] L. Demetrio, B. Biggio, G. Lagorio, F. Roli, and A. Armando, “Ex-
plaining vulnerabilities of deep learning to adversarial malware binaries,”
arXiv preprint arXiv:1901.03583, 2019.
[25] S. E. Coull and C. Gardner, “Activation analysis of a byte-based deep
neural network for malware classification,” in 2019 IEEE Security and
Privacy Workshops (SPW), IEEE, 2019, pp. 21–27.
[26] M. A. Kadri, M. Nassar, and H. Safa, “Transfer learning for malware
multi-classification,” in Proceedings of the 23rd International Database
Applications & Engineering Symposium, 2019, pp. 1–7.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[29] H. Yakura, S. Shinozaki, R. Nishimura, Y. Oyama, and J. Sakuma, “Mal-
ware analysis of imaged binary samples by convolutional neural network
with attention mechanism,” in Proceedings of the Eighth ACM Confer-
ence on Data and Application Security and Privacy, 2018, pp. 127–134.
[30] J. Jung, J. Choi, S.-j. Cho, S. Han, M. Park, and Y. Hwang, “Android
malware detection using convolutional neural networks and data section
images,” in Proceedings of the 2018 Conference on Research in Adaptive
and Convergent Systems, 2018, pp. 149–153.
[31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-
ing the inception architecture for computer vision,” in Proceedings of
the IEEE conference on computer vision and pattern recognition, 2016,
pp. 2818–2826.
[32] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
inception-resnet and the impact of residual connections on learning,” in
Thirty-first AAAI conference on artificial intelligence, 2017.
[33] S. Kumar et al., “Mcft-cnn: Malware classification with fine-tune convo-
lution neural networks using traditional and transfer learning in internet
of things,” Future Generation Computer Systems, 2021.
[34] N. Kumar and T. Meenpal, “Texture-based malware family classifica-
tion,” in 2019 10th International Conference on Computing, Commu-
nication and Networking Technologies (ICCCNT), IEEE, 2019, pp. 1–
6.
46
[35] Y. Zhao, C. Xu, B. Bo, and Y. Feng, “Maldeep: A deep learning classi-
fication framework against malware variants based on texture visualiza-
tion,” Security and Communication Networks, vol. 2019, 2019.
[36] M. Q. Li, B. Fung, P. Charland, and S. H. Ding, “I-mad: A novel
interpretable malware detector using hierarchical transformer,” arXiv
preprint arXiv:1909.06865, 2019.
[37] X. Hu, R. Sun, K. Xu, Y. Zhang, and P. Chang, “Exploit internal struc-
tural information for iot malware detection based on hierarchical trans-
former model,” in 2020 IEEE 19th International Conference on Trust,
Security and Privacy in Computing and Communications (TrustCom),
IEEE, 2020, pp. 927–934.
[38] Y. Ding, S. Wang, J. Xing, X. Zhang, Z. Oi, G. Fu, Q. Qiang, H. Sun,
and J. Zhang, “Malware classification on imbalanced data through self-
attention,” in 2020 IEEE 19th International Conference on Trust, Secu-
rity and Privacy in Computing and Communications (TrustCom), IEEE,
2020, pp. 154–161.
[39] A. Rahali and M. A. Akhloufi, “Malbert: Using transformers for cyberse-
curity and malicious software detection,” arXiv preprint arXiv:2103.03806,
2021.
[40] F. Yang, Y. Zhuang, and J. Wang, “Android malware detection using
hybrid analysis and machine learning technique,” in International Con-
ference on Cloud Computing and Security, Springer, 2017, pp. 565–575.
[41] M. A. Salah, M. F. Marhusin, and R. Sulaiman, “A two-stage malware
detection architecture inspired by human immune system,” in 2018 Cy-
ber Resilience Conference (CRC), IEEE, 2018, pp. 1–4.
[42] J. Hwang, J. Kim, S. Lee, and K. Kim, “Two-stage ransomware detec-
tion using dynamic analysis and machine learning techniques,” Wireless
Personal Communications, vol. 112, no. 4, pp. 2597–2609, 2020.
[43] A. D. Raju and K. Wang, “Echelon: Two-tier malware detection for raw
executables to reduce false alarms,” arXiv preprint arXiv:2101.01015,
2021.
[44] L. Yang and J. Liu, “Tuningmalconv: Malware detection with not just
raw bytes,” IEEE Access, vol. 8, pp. 140 915–140 922, 2020.
[45] A. Yan, Z. Chen, R. Spolaor, S. Tan, C. Zhao, L. Peng, and B. Yang,
“Network-based malware detection with a two-tier architecture for online
incremental update,” in 2020 IEEE/ACM 28th International Symposium
on Quality of Service (IWQoS), IEEE, 2020, pp. 1–10.
[46] R. Wightman, Pytorch image models, https://github.com/rwightman/
pytorch-image-models, 2019. doi: 10.5281/zenodo.4414861.
[47] Malconv2, https://github.com/NeuromorphicComputationResearchProgram/
MalConv2, 2020.
47