Malicious Behavior Detection Method Using API Sequence in Binary Execution Path
Malicious Behavior Detection Method Using API Sequence in Binary Execution Path
17559/TV-20210202132203
Original scientific paper
Malicious Behavior Detection Method Using API Sequence in Binary Execution Path
Abstract: Today, the amount of malware is growing very rapidly, and the types and behaviors of malware are becoming very diverse. Unlike existing malicious codes, new
types or variants of malicious codes are being identified, and it takes a lot of time to analyze all malicious codes. To solve these problems malware analysts analyze and
research effective ways to reduce analysis time and cost. In this paper, we propose a method to express characteristics and detect malicious codes by using API Sequence
for malicious code detection and classification. It compares and analyzes several existing expression methods and verifies the effectiveness through actual malicious code
samples. Using the expression method proposed in the paper, we detected six malicious behaviors: DLL Injection, Downloader, IAT Hooking, Key Logger, Screen Capture
and Antidebugging. As a result, more detection was detected than by conventional detection methods, and it can be seen that the more complex the malicious behavior, the
higher the detection efficiency. In addition, static analysis was adopted as the main method, but because it searches execution compression, the flow of malicious behavior
can be analyzed.
Keywords: API sequence; binary execution path; malware analysis; malware detection
that identify malicious behaviors by using the n-gram malicious behaviors are determined by the API call
technique [9], which cuts the information in the file sequences found in the execution path search, an API
according to a certain standard and processes the cut pieces extraction process is undergone. In this study, graphical
of information. In addition, studies intended to express the imaging is performed based on APIs. However, since not
byte sequences for binary codes with n-grams with a view all the extracted APIs can be applied to imaging because
to classifying malware were also carried out [10]. Such they amount tens of thousands in kind, they are made into
methods of collecting signatures for the internal structure graph images through classification to represent malicious
of a file can be defined with the unique DNA of the file behaviors, and finally, the mutual similarity relations of
[11] and are used for similarity and classification of pieces of malware are shown through image based
malware based on the foregoing. A clear and definite basis similarity determining work.
for judging malicious behaviors is the discovery of the
functions used by malware. Previous studies have 3.2 Static Execution Path Exploration
attempted to identify or classify malware by processing
such APIs within programs. [12-14] First, methods that list The core of the static execution path search in this
the sequences of APIs [15], or collect log information on paper is that the instruction set and subroutine are divided
the use of APIs to determine malicious behaviors [16] are into true and false ones according to the branch instruction
representative. Since APIs are functions used when the before they are searched. First, among the assembly
program is executed, such APIs are either statically instructions, the instructions for branching are jz, jzr, and
collected [17] or dynamically monitored [18]. In addition so on. As for the branch point, comparison instructions
to the methods that simply list the APIs, there are methods such as cmp and test that occur before the branch
that extract the features of malware according to the instructions are issued, are made through logical operation
frequency of use of the APIs inside the file [19]. The above instructions such as xor. We divide true and false marks
studies are statistical methods that have advantages such as according to the branch instructions to search all
not so large amounts of data to be stored, small amounts of instruction sets and subroutines.
operation, and high speed. However, they cannot respond The IDA's disassembly codes can identify the
to malware in real time and cannot accurately judge diverse instruction sets and subroutines used in the search. The
malware behaviors because they are based on simple instruction sets, which are functions that perform some
statistics [20]. To compensate for the foregoing, some behaviors in the functions, are represented by loc_xxxxxx,
studies carried out recently grafted various algorithms onto and the subroutines or basic blocks are represented by
the statistical properties as such to detect malicious special prefixes such as sub_xxxxxx. The IDA provides
behaviors. The eigenvalues of op code based graph images IDAPython, which is a python script, to provide powerful
can be calculated by measuring the distances between the processing activities for binary codes. In this study,
nodes based on the K Nearest Neighbor Algorithm (KNN searches were performed using IDA APIs such as
Algorithm), which is one of the machine learning GetFunctionName, CodeRefsFrom, and CodeRefsTo [30].
algorithms [21]. In addition, the processed strings can be
reprocessed with the Logistic Common Subsequence Table 1 Example binary code to display static execution path search
(LCS) algorithm to measure the eigenvalues of the strings 1 loc_401460 :
2 mov eax, [esp + argc]
[22]. In the studies introduced above, static analysis-based 3 sub esp, 44h
methods collect signatures or list the signatures in 4 cmp eas, 2
sequence, but cannot identify the accurate features of 5 Push ebx
behaviors because they are based on code-based feature 6 push ebp
extraction. To compensate for this this problem, dynamic 7 push ebi
analysis is adopted as the main detection method [23] or a 8 jzn loc_401488
9 loc_401484:
mixture of static and dynamic analyses is adopted [24]. 10 xor eax, eax
However, since dynamic analysis is a method that directly 11 jmp short loc_40148D
executes malware for analysis, it has disadvantages of 12 loc_401488:
energy efficiency and analysis time [25]. In addition to this, 13 sbb eax, eax
several studies and methods are under way to classify 14 sbb eax,0FFFFFFFFh
malware [26, 27]. In our previous study, we studied how to
express the features of malware using APIs [28]. In this Tab. 1 is the binary code for displaying static execution
paper, a method that is based on static analysis but tracks path searches. After CMP instruction in the fourth line, the
the execution flow was proposed so that the effect of binary code loc_401460 is branched into the instruction
dynamic analysis can be expected to compensate for sets of loc_40148 due to the JNZ in the eighth line, which
studies in which static and dynamic analyses are mixed. is a branch instruction. If the result of the comparison is
true, the binary code will be branched into loc_401488, and
3 PROPOSED METHOD if false, into loc_401484. To summarize finally,
3.1 Method Architecture loc_401460 is a binary code, which is branched into
loc_401488 when it is true and into loc_401488 when it is
First, the malware executable file is converted into false.
disassembled binary codes through the binary reverse Visualizing the binary code in Tab. 1 will look like Fig.
engineering tool, IDA [29] (IDA PRO 6.6). The IDA 1.
extracts the disassembled code for the executable into .asm
and uses it as data for static execution path search. Since
GetLastError, and CloseHandle are called in lines 4, 6, 11, 3.4 API Classification
17, and 26 respectively. In this paper, the Normal, True,
and False mark application mechanism is equally applied In this study, malicious behaviors are visualized in the
to APIs to analyze the interactions between the APIs and form of graphs expressed with nodes and intermediate lines
the behaviors of the APIs. as shown in Fig. 5. However, there is a problem that the
number of APIs is too large to make the APIs into nodes.
In this study, to solve such problems, the APIs will be
reclassified into 24 upper categories through the functions
of the APIs so that behaviors can be clearly judged and the
temporal efficiency can be enhanced [31, 32]. For instance,
CreateFile and CreateProcess are APIs that perform
functions related to "files" or "processes" and APIs such as
GetSystemTime and GetLocalTime have the function to
collect information on "time" in the system. In addition,
APIs such as strcmp and stcat all perform functions related
to strings. Such a classification not only has many
categories to which APIs commonly belong although they
have been already classified in MSDN but also is too
Figure 3 Search for the static execution path of the binary code abstract to understand behaviors. For instance, all process-
related APIs are included in the category process but
whether the relevant APIs created, deleted, or accessed
processes cannot be known. Therefore, the functions of
APIs were reclassified into three, which are
CREATE_OR_OPEN, READ_OR_ACCESS, and
CLOSE. Tab. 4 shows the final 24 API categories.
searches despite the fact that it is a static analysis so that for manipulation of dlls, the API behaviors can be
the effects of dynamic analysis can be expected. confirmed as DLL injection that inserts code into the
As a representative example, when the malicious remote process of calling LoadLibrary to forcibly make
behavior of Trojan.Graftor.D4C56B has been analyzed by the DLL to be loaded into the context of the relevant
the method proposed in this paper, the graphic image process.
shown in Fig. 6 appears.
Fig. 6 shows that the malware uses APIs such as 4.2 Comparison with Dynamic Analysis
String, SystemInformation, Module, and Process. In
particular, a detailed analysis of the red shaded API With regard to the behaviors shown in 4.1, the existing
behaviors is as follows. In light of the fact that the relevant simple API collection and listing method, the API Monitor
APIs use processes such as OpenProcess, Process32Next, based [33] dynamic analysis, and the method proposed in
WriteProcessMemory, and VirutalAllocEx and APIs used this paper are compared as shown in Tab. 5.
DLL Injection
Tab. 6 shows the interactions of the behaviors of all
APIs. These actions of behaviors are shown after being
combined by the categorization of APIs as shown in Tab.
Downloader 7 for the clarity of analysis methods and the efficiency of
analysis time.
Table 7 Malicious Behavior in Categorization of API In addition, an example of applying each of the
Behavior DLL injection methods to Tab. 8 code is also shown.
The test set is 1236 pieces of randomly generated
Behavior malware and all of them include an IAT (Import address
Grpah Table) because the method proposed in this paper analyzes
Image the interactions between APIs. First, through the identified
common behavior graphs, each malicious behavior was
PROCESS- analyzed based on the data set consisting of 1,236 pieces
Sequence READ_OR_ACESS(TRUE)RESOURCE of malware.
(TRUE) LIBRARY)(NORMAL)THREAD Figs. 7 to 12 are graphs comparing the method
Behavior Downloader
proposed in this study and the existing method. It can be
seen that the proposed methods show larger numbers of
detection of the malicious behaviors, DLL injection, IAT
Behavior
Grpah Hooking, Screen Capture, and Anti Debugging when
Image compared to the existing detection methods.
Sequence NETWORK-READ_OR_ACESS(TRUE)LIBRARY
Behavior
Grpah
Image
LIBRARY(TRUE)STRING
Sequence
(NORMAL)RESOURCE
Behavior KeyLogger
Behavior
Grpah Figure 7 DLL Injection
Image
Sequence WINDOW-GUI-BITMAP(TRUE)HOOK
4.4 Efficiency
paper. These results can be proved based on the accuracy sequence as the main method. All the accuracy and f-
and f-measure values based on the wrong detection rates measure values of the method proposed in this paper were
and detection missing rates of existing studies. measured to be higher compared to previous studies. In
In this paper, we compared the proposed method with addition, the pieces of malware detected by the method
previous studies and binary classification results. Since this proposed were identified to show an average Virustotal
paper is an API sequence based on static analysis, all [37] detection rate of 69%.The summary of the contents
previous comparative studies are based on static analysis. can be found in Tab. 10.
[19, 34] used API frequency, and [35, 36] used API
In this paper, a method to detect execution paths based [1] Neumann, J. & Burks, A. W. (1966). Theory of self-
on static analysis and judge malicious behaviors based on reproducing automata. Urbana: University of Illinois press.
APIs' interrelationships was proposed. Although static [2] Gandotra, E., Bansal, D., & Sofat, S. (2014). Malware
analysis is the main analysis, the method proposed in this analysis and classification: A survey. Journal of Information
Security, 2014. https://doi.org/10.4236/jis.2014.52006
paper enables analyzing the flow of behaviors because it [3] Sharif, M. I., Lanzi, A., Giffin, J. T., & Lee, W. (2008).
searches execution paths. This means that although static Impeding Malware Analysis Using Conditional Code
analysis is adopted as a main method, the advantages of Obfuscation. NDSS.
dynamic analysis that directly executes APIs to analyze the [4] Bilar, D. (2007). Opcodes as predictor for malware.
APIs are applied to the method proposed in this paper. In International journal of electronic security and digital
this study, execution flows were analyzed according to forensics, 1(2), 156-168.
branch instructions and the interactions of APIs collected https://doi.org/10.1504/IJESDF.2007.016865
during the flows were analyzed. API interactions are [5] Griffin, K., Schneider, S., Hu, X., & Chiueh, T. C. (2009).
marked as normal, true, and false and are reclassified into Automatic generation of string signatures for malware
detection. International workshop on recent advances in
and listed as 24 upper categories. In this study, the intrusion detection, 101-120. Springer, Berlin, Heidelberg.
detection method based on the relevant method was https://doi.org/10.1007/978-3-642-04342-0_6
compared with the existing simple API collecting method [6] Shafiq, M. Z., Tabish, S. M., Mirza, F., & Farooq, M. (2009,
and API listing method. The malicious behaviors used for September). Pe-miner: Mining structural information to
the comparison are six behaviors, which are dll injection, detect malicious executables in real time. International
downloader, IAT hooking, key logger, screen capture, and workshop on recent advances in intrusion detection, 121-141.
anti-debugging. The method proposed in this paper showed Springer, Berlin, Heidelberg.
high efficiencies in the discrimination of four behaviors https://doi.org/10.1007/978-3-642-04342-0_7
[7] Santos, I., Brezo, F., Nieves, J., Penya, Y. K., Sanz, B.,
among the six behaviors except for downloader and the key
Laorden, C., & Bringas, P. G. (2010, February). Idea:
logger. This is because the API interactions of downloader Opcode-sequence-based malware detection. International
and key logger are insufficient for judgment of the Symposium on Engineering Secure Software and Systems,
behaviors as being malicious. This is related to the 35-43. Springer, Berlin, Heidelberg.
complexity of malicious behaviors. As malicious behaviors https://doi.org/10.1007/978-3-642-11747-3_3
became more complicated, higher efficiencies of detection [8] Hu, K. G. S. S. X. & Chiueh, T. C. (2008). Automatic
appeared because the grounds for judgment of malicious Generation of String Signatures for Malware Detection.
behaviors became more sufficient. In future studies, the Symantec Research Laboratories, 1-29.
frequencies of behaviors will be added to prepare grounds [9] Santos, I., Penya, Y. K., Devesa, J., & Bringas, P. G. (2009).
N-grams-based File Signatures for Malware Detection.
for judgment of detailed behaviors. The utilization of such
ICEIS, 9(2), 317-320.
numerical data can be extended to apply machine learning https://doi.org/10.5220/0001863603170320
and various statistics based algorithms, and based on such [10]Moskovitch, R., Feher, C., Tzachar, N., Berger, E., Gitelman,
data, malware will be visualized and malware similarity M., Dolev, S., & Elovici, Y. (2008). Unknown malcode
will be calculated. detection using opcode representation. European conference
on intelligence and security informatics, 204-215. Springer,
Acknowledgements Berlin, Heidelberg.
https://doi.org/10.1007/978-3-540-89900-6_21
This research was supported by the 2018 Yeungnam [11] Choi, Y. H., Han, B. J., Bae, B. C., Oh, H. G., & Sohn, K. W.
(2012). Toward extracting malware features for
University Research Grant (218A061016, 218A380138)
classification using static and dynamic analysis. 8th
and the National Research Foundation of Korea (NRF) International Conference on Computing and Networking
grant funded by the Korea government (MSIT) (No. Technology (INC, ICCIS and ICMIC), 126-129).
2018R1D1A1B07050647). [12] Zhang, M., Duan, Y., Yin, H., & Zhao, Z. (2014). Semantics-
aware android malware classification using weighted
contextual api dependency graphs. Proceedings of the 2014 [28] Jihun, K., Sung, W. L., & Jonghee, Y. (2021). Expression of
ACM SIGSAC conference on computer and communications malware characteristics using API sequence. Journal of
security, 1105-1116. https://doi.org/10.1145/2660267.2660359 Smart Technology Applications, 2(1).
[13] Lu, H., Wang, X., & Su, J. (2013). SCMA: Scalable and [29] Eagle, C. (2011). The IDA pro book.
collaborative malware analysis using system call sequences. [30] See https://www.hexrays.com/products/ida/support/
International Journal of Grid and Distributed Computing, idapython _docs/
6(2), 11-28. [31] Zhou, B., Xia, X., Lo, D., Tian, C., & Wang, X. (2014).
[14] Elhadi, A. A. E., Maarof, M. A., & Barry, B. I. (2013). Towards more accurate content categorization of API
Improving the detection of malware behaviour using discussions. Proceedings of the 22nd International
simplified data dependent API call graph. International Conference on Program Comprehension, 95-105.
Journal of Security and Its Applications, 7(5), 29-42. https://doi.org/10.1145/2597008.2597142
https://doi.org/10.14257/ijsia.2013.7.5.03 [32] Uppal, D., Sinha, R., Mehra, V., & Jain, V. (2014). Exploring
[15] Uppal, D., Sinha, R., Mehra, V., & Jain, V. (2014, behavioral aspects of API calls for malware identification
September). Malware detection and classification based on and categorization. International Conference on
extraction of API sequences. International conference on Computational Intelligence and Communication Networks,
advances in computing, communications and informatics 824-828. https://doi.org/10.1109/CICN.2014.176
(ICACCI), 2337-2342. [33] See http://www.rohitab.com/apimonitor
https://doi.org/10.1109/ICACCI.2014.6968547 [34] Sami, A., Yadegari, B., Rahimi, H., Peiravian, N., Hashemi,
[16] Fan, C. I., Hsiao, H. W., Chou, C. H., & Tseng, Y. F. (2015). S., & Hamze, A. (2010). Malware detection based on mining
Malware detection systems based on API log data mining. API calls. Proceedings of the 2010 ACM symposium on
39th annual computer software and applications conference, applied computing, 1020-1025.
3, 255-260. https://doi.org/10.1109/COMPSAC.2015.241 https://doi.org/10.1145/1774088.1774303
[17] Alazab, M., Venkataraman, S., & Watters, P. (2010, July). [35] Sathyanarayan, V. S., Kohli, P., & Bruhadeshwar, B. (2008).
Towards understanding malware behaviour by the extraction Signature generation and detection of malware families.
of API calls. Second cybercrime and trustworthy computing Australasian Conference on Information Security and
workshop, 52-59. https://doi.org/10.1109/CTC.2010.8 Privacy, 336-349. Springer, Berlin, Heidelberg.
[18] Rajagopalan, M., Hiltunen, M. A., Jim, T., & Schlichting, R. https://doi.org/10.1007/978-3-540-70500-0_25
D. (2006). System call monitoring using authenticated [36] Ye, Y., Wang, D., Li, T., & Ye, D. (2007). IMDS: Intelligent
system calls. IEEE Transactions on Dependable and Secure malware detection system. Proceedings of the 13th ACM
Computing, 3(3), 216-229. SIGKDD international conference on Knowledge discovery
https://doi.org/10.1109/TDSC.2006.41 and data mining, 1043-1047.
[19] Alazab, M., Venkatraman, S., Watters, P., & Alazab, M. https://doi.org/10.1145/1281192.1281308
(2010). Zero-day malware detection based on supervised [37] See https://www.virustotal.com
learning algorithms of API call signatures.
[20] Moser, A., Kruegel, C., & Kirda, E. (2007). Limits of static
analysis for malware detection. Twenty-Third Annual Contact information:
Computer Security Applications Conference (ACSAC 2007),
421-430. https://doi.org/10.1109/ACSAC.2007.21 Jihun KIM, M.S.
[21] Firdausi, I., Erwin, A., & Nugroho, A. S. (2010, December). Dept. of Computer Engineering, Yeungnam University,
Analysis of machine learning techniques used in behavior- 280 Daehak-Ro, Gyeongsan, Gyeongbuk, Republic of Korea
E-mail: f13521@naver.com
based malware detection. Second international conference
on advances in computing, control, and telecommunication Sungwon LEE, M.S.
technologies, 201-203. https://doi.org/10.1109/ACT.2010.33 Dept. of Computer Engineering, Yeungnam University,
[22] Blount, J. J., Tauritz, D. R., & Mulder, S. A. (2011, July). 280 Daehak-Ro, Gyeongsan, Gyeongbuk, Republic of Korea
Adaptive rule-based malware detection employing learning E-mail: noke15@ynu.ac.kr
classifier systems: a proof of concept. 35th Annual Computer
Software and Applications Conference Workshops, 110-115. Jonghee YOUN, PhD, Professor
https://doi.org/10.1109/COMPSACW.2011.28 (Corresponding author)
Dept. of Computer Engineering, Yeungnam University,
[23] Nair, V. P., Jain, H., Golecha, Y. K., Gaur, M. S., & Laxmi,
280 Daehak-Ro, Gyeongsan, Gyeongbuk, Republic of Korea
V. (2010). Medusa: Metamorphic malware dynamic analysis E-mail: youn@yu.ac.kr
using signature from api. Proceedings of the 3rd
International Conference on Security of Information and
Networks, 263-269. https://doi.org/10.1145/1854099.1854152
[24] Roundy, K. A. & Miller, B. P. (2010). Hybrid analysis and
control of malware. International Workshop on Recent
Advances in Intrusion Detection, 317-338. Springer, Berlin,
Heidelberg. https://doi.org/10.1007/978-3-642-15512-3_17
[25] Egele, M., Scholte, T., Kirda, E., & Kruegel, C. (2008). A
survey on automated dynamic malware-analysis techniques
and tools. ACM computing surveys (CSUR), 44(2), 1-42.
https://doi.org/10.1145/2089125.2089126
[26] Sharma, A. & Sahay, S. K. (2016). An effective approach for
classification of advanced malware with high accuracy.
https://doi.org/10.14257/ijsia.2016.10.4.24
[27] Hordri, N. F., Ahmad, N. A., Yuhaniz, S. S., Sahibuddin, S.,
Ariffin, A. F. M., Saupi, N. A. M., Senan, M. F. E. M., et al.
(2018). Classification of malware analytics techniques: a
systematic literature review. International journal of security
and its applications, 12(2), 9-18.
https://doi.org/10.14257/ijsia.2018.12.2.02