Malware Classification Using Static Disassembly and Machine Learning
Malware Classification Using Static Disassembly and Machine Learning
Learning
This paper was downloaded from TechRxiv (https://www.techrxiv.org).
LICENSE
CC BY 4.0
17-12-2021 / 21-12-2021
CITATION
Chen, Zhenshuo; Brophy, Eoin; Ward, Tomas (2021): Malware Classification Using Static Disassembly and
Machine Learning. TechRxiv. Preprint. https://doi.org/10.36227/techrxiv.17259806.v1
DOI
10.36227/techrxiv.17259806.v1
Malware Classification Using Static Disassembly and Machine Learning
Abstract—Network and system security are incredibly criti- for example, through the use of obfuscation and encryption.
cal issues now. Due to the rapid proliferation of malware, This has now led to a rapid proliferation of malware which
traditional analysis methods struggle with enormous samples. traditional analysis methods struggle to cope with as these
In this paper, we propose four easy-to-extract and small- rely on signature matching and heuristic rules.
scale features, including sizes and permissions of Windows PE A signature is a model or hash that can uniquely identify
sections, content complexity, and import libraries, to classify a file by machine code, essential strings, or sensitive instruc-
malware families, and use automatic machine learning to tion sequences. A large database stores existing samples’
search for the best model and hyper-parameters for each fea- signatures and they will be compared with the signature gen-
ture and their combinations. Compared with detailed behavior- erated from an unknown file to match. This technique is easy
related features like API sequences, proposed features provide to implement but sensitive even to tiny code modification.
macroscopic information about malware. The analysis is based Furthermore, it is also difficult to recognize totally new mal-
on static disassembly scripts and hexadecimal machine code. ware because its signature has not yet been captured in the
Unlike dynamic behavior analysis, static analysis is resource- database. Heuristic rules are determined by malware experts
efficient and offers complete code coverage, but is vulnerable after analyzing malicious behaviors. In general, they need to
to code obfuscation and encryption. The results demonstrate review code instructions, check memory data changes and
that features which work well in dynamic analysis are not record system events to understand each sample.
necessarily effective when applied to static analysis. For in- These traditional methods have the same drawback:
stance, API 4-grams only achieve 57.96% accuracy and involve unseen samples must be manually analyzed before creat-
a relatively high dimensional feature set (5000 dimensions). In ing signatures or heuristic rules. However, analysts cannot
contrast, the novel proposed features together with a classical review each unknown file in practice. Machine learning
machine learning algorithm (Random Forest) presents very
approaches in contrast do not rely on understanding code
good accuracy at 99.40% and the feature vector is of much
and malicious behaviors. After training with a wide range
of known samples, such methods can more easily identify
smaller dimension (40 dimensions). We demonstrate the ef-
potential malware compared to human experts. Some au-
fectiveness of this approach through integration in IDA Pro,
tomatic models have been applied in related fields, such
which also facilitates the collection of new training samples
as malware homology analysis by dynamic fingerprints in
and subsequent model retraining.
[2], and gray-scale image representation of malware in [3],
Index Terms—Malware Classification, Reverse Engineering, which did not require disassembly or code execution.
Machine Learning, System Security We adopt a machine learning approach based on static
analysis in this work. The primary exploration and experi-
ments of this paper are as follows:
1. INTRODUCTION
• Four small-scale features are proposed: import li-
Network and system security are incredibly critical is- braries, section sizes, section permissions, and con-
sues at the moment. According to [1], 142 million threats tent complexity. The feature descriptions are in Sec-
were being blocked every day in 2019. Furthermore, new tion. 4.4, Section. 4.5, Section. 4.6, and Section. 4.7
types of malware are appearing all the time and are increas- respectively. Unlike traditional large-scale features
ingly aggressive. For instance, the use of malicious Power- like API n-grams and opcode n-grams, which fo-
Shell scripts increased by 1000% in the same year. To make cus on detailed APIs and assembly instructions,
matters worse, anti-anti-virus techniques used by attackers these new features are easy-to-extract and provide
are also steadily improving. The use of polymorphic engines macroscopic information about malware. In the ex-
allows malware developers to mutate existing code while periments using Random Forest, their combination
retaining the original functions unchanged. This is achieved, achieved a maximum of 99.40% accuracy with only
40 dimensions.
• The API n-gram, an efficient dynamic behavior fea-
ture usually generated by system event logging, is
applied to static analysis. The result demonstrates
that actual API sequences are hard to extract from
disassembly scripts. Inaccurate n-grams have a sub-
stantial negative impact on classification. In the spe-
cific studies done here, the highest accuracy based
on such features is only 57.96% with Random Forest
and a 5000-dimensional feature vector.
• A method of using the classifier in practice is pro-
posed. With the help of IDA Pro [4], the most
popular reverse analysis tool, new training data can
be generated from the latest known malware. It also
provides a Python development kit and using this the
classifier proposed here is implemented as an IDA
Pro plug-in. This allows an analyst using IDA Pro to
process a malware sample and perform classification
immediately within their workflow.
Figure 1. Compilation, assembly and disassembly
2. BACKGROUND
2.1. Machine Code and Assembly Languages for humans to understand. A simple way is to insert data
bytes into code. Listing. 1 provides an example. Because of
the jmp next instruction, the byte definition db 10 will
CPUs can only process machine code, which of course
not be executed as if it does not exist. But disassemblers
consists of binary numbers. However, it is incredibly chal-
may treat this byte as code which makes the following in-
lenging to program directly with machine code so assembly
struction incorrect as Listing. 2. db 10 and mov eax, 0
languages are used instead. They use a mnemonic to rep-
are wrongly translated as or bh, byte ptr [eax].
resent each low-level machine code or instruction. There is
a strong correspondence between assembly instructions and Listing 1. Inserting a data byte into code
an architecture’s machine code. Every assembly language jmp next
is designed for exactly one specific computer architecture, db 10
such as ARM and Intel x86. The conversion from assembly next:
languages to executable machine code is performed by mov eax, 0
assemblers, available since the 1950s.
Compared with assembly languages, C and C++ are Listing 2. Incorrect linear sweep disassembly for Listing. 1
higher-level programming languages. The source code of jmp next
C/C++ is translated into assembly languages firstly by com- or bh, byte ptr [eax]
pilers, then converted into machine code by assemblers as Another way of performing code obfuscation is to use
shown in Fig. 1. roundabout expressions. For example, a logical operation
For malware analysts, there is no high-level source code aˆc can be inflated to aˆbˆcˆb. Attackers can also use
available for review but only executable files, like .exe and jump instructions to make the real execution flow different
.dll on Windows systems. Because of the correspondence from the disassembly script, as in Fig. 2.
between assembly languages and machine code, executable
files can be translated into assembly instructions as in Fig. 1.
This process is called disassembly.
5. EXPERIMENTS
For each feature and its combination, we used automatic
machine learning library auto-sklearn to search for the
best parameters, relying on Bayesian optimization, meta-
learning and ensemble construction [13]. 80% of the dataset
was used as a training set and auto-sklearn evaluated
Figure 7. The distribution of writable virtual sizes
models on it using 5-fold cross-validation. The models
Additionally, we think these two PE section features include K -Nearest Neighbors, Support Vector Machine and
(sizes and permissions) have compatibility with Linux sys- Random Forest. All experiments were conducted on 64-bit
tems. Linux uses the Executable and Linkable Format Ubuntu, Intel(R) Core(TM) i7-6700 CPU (3.40GHz) with
(ELF)1 for executable files. It has similar section structures 12GB RAM. Each model’s parameter search process lasted
to the PE format. up to one hour. After auto-sklearn had determined a
model’s optimal parameters, we used the remaining 20% as
a test set to calculate classification accuracy. The results are
4.7. Content Complexity shown in Table. 3, sorted in increasing order of accuracy.
Random Forest provided the best performance in all exper-
Content complexity is a new feature type for malware iments. For the “Dimension” column of some features, the
classification. What we propose here has six fixed dimen- numbers before and after the arrow indicate the size of the
sions: the original sizes, compressed sizes and compression feature before and after the feature selection respectively.
ratios of disassembly and machine code files. We used Among individual features, opcode 4-grams provided
Python’s zlib library to compress samples and recorded the highest accuracy of 99.08%, meaning static disassembly
size changes. This approximates function complexity, code does not invoke many negative impacts on opcode 4-grams.
encryption and obfuscation. Listing. 5 is from the sample They are effective both in dynamic and static analysis,
with the largest disassembly compression ratio of 12.8. It but their extraction requires much time and computational
might be obfuscated with repetitive, roundabout instructions. resources. The original dimension of opcode 4-grams before
In contrast, Listing. 6 has the smallest disassembly com- feature selection is the largest (1408515). Content complex-
pression ratio of 2.3. The disassembly process failed and ity, PE section sizes and PE section permissions achieved
IDA Pro can only output its original machine code. This 98.11%, 97.75% and 97.01% accuracy respectively, which
is because the sample is encrypted and packed by UPX2 , a are satisfactory considering they are low dimensional repre-
famous open-source packer for executable files. It is difficult sentations. Import libraries did not perform very well, but
to compress the encrypted content again, which causes a low the prediction paths generated by a Decision Tree can pro-
compression ratio. In addition to this, the use of complex, vide functionality comparisons between malware families,
rare instructions can also lead to low compression ratios. like Fig. 6. Other features except API sequences cannot do
this. At the beginning, we expected that the API sequence
Listing 5. Disassembly snippet with the largest compression ratio
would be an effective feature in static analysis, as it does
mov eax, esi
ror eax, 8 in dynamic analysis. Unexpectedly, the API 4-gram is the
mov ecx, esi worst. Its accuracy is only 57.96% and involves a 5000-
dimensional feature vector representation. Our result shows
1. https://en.wikipedia.org/wiki/Executable and Linkable Format that incorrect sequences extracted from static disassembly
2. https://upx.github.io scripts have a very negative effect on feature validity.
TABLE 3. T HE FEATURE ACCURACY
6.1. Lazy Loading Name mangling adds noise to the API n-gram extraction.
For the same or similar functions, we may extract more
In the process of extracting import libraries, only the than one name. A theoretical solution is to convert mangled
libraries in the Import Table can be extracted, which is names back to the same original name. However, in practice,
a structure in PE Headers used to import external APIs. it is challenging to develop converters for every possible
These libraries will be automatically loaded when malware compiler and language. Moreover, some compilers do not
starts. In order to make malicious behavior more hidden, disclose their detailed name mangling mechanism.
developers can use lazy loading to load a library just before
it is about to be used. Lazily loaded libraries cannot be 6.3. Jump Thunk
extracted from static content. Table. 4 shows the top libraries
based on Gini Impurity. They are ubiquitous and have no
Jump Thunk is the second reason for the poor perfor-
special significance for malware classification. A reasonable
mance of API sequences. Many compilers generate a jump
speculation is that sensitive libraries are lazily loaded and
thunk, a small code snippet, for each external API, then
PE Headers only contain regular libraries.
convert all calls to an API into calls to its jump thunk. This
mechanism can provide an interface proxy for an API. In
TABLE 4. T HE TOP FIVE IMPORTANT LIBRARIES
Listing. 7, there are two Windows file manipulation APIs.
Library Description After defining all jump thunks in the beginning part of the
code, the rest only uses a jump thunk to call an external
MSASN1 Abstract Syntax Notation One Runtime
MSVCRT Microsoft Visual C++ Runtime
API. For instance, all calls to WriteFile became calls to
UXTheme Microsoft Windows Controls its thunk j_write_file.
OpenGL32 Open Graphics Library
ADVAPI32 Microsoft Security and Registry Services Listing 7. Jump thunks
j_write_file proc
jmp WriteFile
j_write_file endp malware, not disassembly scripts like those pro-
vided in the Microsoft’s dataset. To generate similar
j_read_file proc data, we developed an IDA Pro script that can
jmp ReadFile be run from the command line with IDA Pro’s
j_read_file endp
parameters -A and -S, which launch IDA Pro
call j_read_file in autonomous mode and make it run a script.
call j_write_file For each executable file, it produces disassembly
call j_read_file instructions and hexadecimal machine code, relying
on IDA Pro’s disassembler. These two output files
Jump thunks make API sequences inaccurate. If we are in the same format as the files used for training
simply use linear scanning to extract external API calls in the dataset.
from Listing. 7, we will get the sequence WriteFile, 2) Automatic Classification
ReadFile. But the true sequence is ReadFile, We used another automatic machine learning li-
WriteFile, ReadFile, covered and hidden by their brary TPOT to search for the best model for
thunks. Theoretically, we can recognize jump thunks and the feature combination of PE section sizes, PE
match them to external APIs. But thunks’ names are ran- section permissions and content complexity. We
dom and their content may be more complex than jump think this combination maintains a good bal-
instructions. ance between accuracy and the number of dimen-
sions. TPOT achieved 99.26% accuracy, slightly
7. PRACTICAL APPLICATION lower than auto-sklearn (99.40%). Unlike
auto-sklearn, TPOT uses Genetic Program-
As discussed in [8], unlike other machine learning ap- ming to optimize models [14]. Once the search is
plications like handwritten digit classification, where the complete, it will provide Python code for the best
shape of numbers is not updated over time, the similarity pipeline. auto-sklearn does not have a similar
between previous and future malware will degrade over function. With the fitted model, we developed an
time due to function updates and polymorphic techniques. IDA Pro classifier plug-in. When an analyst opens
Polymorphic techniques can automatically and frequently an executable sample with IDA Pro, the plug-in will
change identifiable characteristics like encryption types and produce the required disassembly and machine code
code distribution to make malware unrecognizable to anti- files, extract features and perform classification as
virus detection. To solve this, we designed an automatic in Listing. 8.
malware classification workflow to apply and enhance our 3) Manual Classification
classifier in practice with IDA Pro’s Python development Although automatic classification is very useful,
kit, as shown in Fig. 8. The source code is available on the result may be inaccurate or in doubt especially
the GitHub3 and makes available as practical features the when a sample does not belong to known families.
following contributions. Therefore the plugin provides the probability dis-
tribution for analysts to perform in-depth analysis
manually to determine a sample’s exact family.
4) Model Training
With sufficient output files and labels of the latest
samples, the classifier can be retrained and strength-
ened either manually or in an automated fashion.
Figure 8. The automatic malware classification workflow Our model was trained on these nine malware families
only, so if an input sample does not belong to them, the
1) Data Generation model will get an incorrect result or classify the sample into
In general, analysts can only collect raw executable the family that is most similar to its actual type. In the ideal
case, these features are applicable to more families if more
3. https://github.com/czs108/Microsoft-Malware-Classification datasets are available. We just need to retrain the mode. But
we think that as the number of malware families becomes Security, ser. VizSec ’11. New York, NY, USA:
larger, their effectiveness may gradually decrease. They may Association for Computing Machinery, 2011. [Online].
be too simple to distinguish between huge families. Available: https://doi.org/10.1145/2016904.2016908
[4] Hex-Rays, IDA Help: The Interactive Disassembler
8. CONCLUSION AND FUTURE WORK Help, Liège, Belgium, 2021. [Online]. Avail-
able: https://hex-rays.com/products/ida/support/idadoc/
This paper demonstrates how novel, highly discrimina- index.shtml
tive features of relatively low dimensionality when com- [5] Microsoft Corporation, “PE format,” Available at
bined with automatic machine learning approaches can pro- https://docs.microsoft.com/en-us/windows/win32/
vide highly competitive accuracy for malware classifica- debug/pe-format (2021/05/10), 2021.
tion. Compared with traditional manual analysis, machine [6] Z. Fuyong and Z. Tie-zhu, “Malware detection and
learning can provide a fast and accurate classifier after classification based on n-grams attribute similarity,”
training on the latest malware samples. It does not rely 22017 IEEE International Conference on Computa-
on an understanding of code. Unlike API and opcode n- tional Science and Engineering (CSE) and IEEE In-
grams, which aim to match specific malicious operations, ternational Conference on Embedded and Ubiquitous
our features focus more on macroscopic information about Computing (EUC), vol. 01, pp. 793–796, 2017.
malware. In theory, these features are more compatible [7] M. Nahid, B. Mehdi, and R. Hamid, “An improved
with multiple operating systems and not susceptible to code method for packed malware detection using PE header
encryption. One shortcoming is that they cannot offer a and section table information,” International Journal of
detailed understanding of malicious behaviors like API se- Computer Network and Information Security, vol. 11,
quences. Analysts must combine multiple features in order pp. 9–17, 09 2019.
to perform more in-depth analysis. In addition, the negative [8] D. Gibert, C. Mateu, and J. Planes, “The
limitations and effects of static text are more severe than rise of machine learning for detection and
we thought, especially for API n-grams. It is challenging to classification of malware: Research developments,
extract exact API sequences from disassembly scripts simply trends and challenges,” Journal of Network and
with linear scanning. Computer Applications, vol. 153, p. 102526, 2020.
We conclude with a number of open avenues for research [Online]. Available: https://www.sciencedirect.com/
that might reduce the negative effects of static analysis and science/article/pii/S1084804519303868
improve machine learning models for malware processing: [9] M. Kalash, M. Rochan, N. Mohammed, N. D. B.
Bruce, Y. Wang, and F. Iqbal, “Malware classification
• Remove regular libraries from the import library with deep convolutional neural networks,” in 2018 9th
feature. Machine learning models are forced to use IFIP International Conference on New Technologies,
only sensitive libraries to classify samples. Note a Mobility and Security (NTMS), 2018, pp. 1–5.
potential problem here is that only a tiny number of [10] K. Simonyan and A. Zisserman, “Very deep convo-
sensitive libraries may be extracted. lutional networks for large-scale image recognition,”
• Although many C/C++ compilers exist, there are arXiv preprint arXiv:1409.1556, 2014.
not many commonly used versions. We can consider [11] E. Raff, J. Barker, J. Sylvester, R. Brandon, B. Catan-
developing name demangling for common compilers zaro, and C. Nicholas, “Malware detection by eating a
and renaming APIs using our defined convention. whole EXE,” 2017.
• The core of a disassembly script is assembly instruc- [12] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and
tions. So assemblers may be helpful to perform code M. Ahmadi, “Microsoft malware classification chal-
analysis to determine the correspondence between lenge,” ArXiv, vol. abs/1802.10135, 2018.
APIs and jump thunks. [13] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg,
M. Blum, and F. Hutter, “Efficient and robust
References automated machine learning,” in Advances in Neural
Information Processing Systems 28, C. Cortes, N. D.
[1] Symantec Corporation, “Internet security threat re- Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,
port,” Available at https://docs.broadcom.com/doc/ Eds. Curran Associates, Inc., 2015, pp. 2962–2970.
istr-24-2019-en (2021/05/10), Symantec Corporation, [Online]. Available: http://papers.nips.cc/paper/5872-
Tech. Rep., 2019. efficient-and-robust-automated-machine-learning.pdf
[2] R. Zheng, Y. Fang, and L. Liu, “Homology analysis of [14] T. T. Le, W. Fu, and J. H. Moore, “Scaling tree-based
malicious code based on dynamic-behavior fingerprint automated machine learning to biomedical big data
(in Chinese),” Journal of Sichuan University (Natural with a feature set selector,” Bioinformatics, vol. 36,
Science Edition), vol. 53, no. 004, pp. 793–798, 2016. no. 1, pp. 250–256, 2020.
[3] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S.
Manjunath, “Malware images: Visualization and
automatic classification,” in Proceedings of the 8th
International Symposium on Visualization for Cyber