Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Malsight: Exploring Malicious Source Code and Benign Pseudocode for Iterative
Binary Malware Summarization

Haolang Lu§, Hongrui Peng§, Guoshun Nan*,
Jiaoyang Cui, Cheng Wang, Weifei Jin
Beijing University of Posts and Telecommunications, Beijing, China
lhl_2507@bupt.edu.cn, penghongruif@bupt.edu.cn, nanguo2021@bupt.edu.cn,
skyboard@bupt.edu.cn, wang.me@bupt.edu.cn, weifeijin@bupt.edu.cn
Abstract

Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations, and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored.

To this end, we propose Malsight, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summaries, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS dataset and a benign pseudocode dataset. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting the usability, accuracy, and completeness of summaries. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed Malsight. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger ChatGPT3.5.

Index Terms:
Malware, Code Summarization, Binary Code

1 Introduction

The AV-TEST Institute [1] recently reported that over 450,000 new malicious files and potentially unwanted applications are registered daily, showing a high demand for malware understanding. Binary malware summarization [2] is a reverse engineering [3] task that aims to automatically generate concise human-readable descriptions of binary executable malicious files. The summarization provides security analysts with a quick understanding of the malware’s functionality and patterns when source code is unavailable, thereby benefiting a wide range of applications such as malware cracking [4] [5] [6], malware family classification [7] [8], binary code similarity detection [9] [10] [11], and large-scale malware behavior analysis [12] [13] [14].

Refer to caption
Figure 1: The comparison of source code (left) and its pseudocode (right). The pseudocode includes significantly more content and a more complex structure, and it also strips key semantic cues such as function names.

Existing reverse engineering tools, such as IDA [15] and Ghidra  [16], can decompile executables into higher-level C-like pseudocode, while they still lack easy-to-understand semantics information. Consequently, a line of efforts attempts to generate human-readable summaries based on pseudocode. Early studies rely on manual parsing or rule-based summary generation [17] [18]. Recent large language models (LLMs), such as BinT5 [19], HexT5 [20], CodeGen [21], and WizardCoder [22], have shown great potential to produce more informative summaries. However, these data-driven approaches still face critical issues, including poor usability, inaccurate explanations, and inaccurate explanations [2]. Figure 1 shows the underlying reasons for the above issues by comparing the source code of the function “initLevel” to the corresponding pseudocode. We observe that the pseudocode presents 1) significantly more content that increases from 20 lines in source to 117 lines in the pseudocode, 2) a more complex and obscure structure with multi-level nesting and entangled logic. The pseudocode involves 29 more calls and 29 more if statements compared to the source code at the left, 3) stripping key semantic cues such as variable names and function names. For example, the function “initLevel” in source code is transferred to a meaningless symbol “sub_404018”.

To address the above challenges, we present Malsight, a novel binary malware summarization framework that can iteratively generate descriptions of executable malware by exploring malicious source code and benign pseudocode. The proposed Malsight involves three key ingredients, including a malware dataset MalS, an LLM-based malware summarization model MalT5, and an evaluation metric BLEURT-sum. We describe the workflow of the proposed framework in four steps as follows.

Constructing MalS: As an LLM-based summarization model heavily relies on high-quality annotations to learn to align with domain-specific knowledge, it necessitates high-quality malware pseudocode summaries to fine-tune the LLM. However, the public malware pseudocode summarization dataset is unavailable so far, and building such a benchmark is quite challenging as it requires huge human involvement for accurate annotations. Figure 1 illustrates three challenges of understanding malware pseudocode. To tackle this issue, we alternatively construct MalS, a large-scale summarization dataset using an LLM model, and malicious C language source code crawled from GitHub. The proposed MalS involves nearly 90,000 malware source functions, with 20 types of malware functions. We also construct a small dataset MalP for testing. We detail such a procedure in Section 4.3.

Training MalT5: We use CodeT5+ [23] as the foundation model of our MalT5. We sequentially fine-tune the proposed MalT5 model on the MalS dataset and an existing benign pseudocode summarization dataset [19]. The underlying intuition is that the malicious semantic knowledge from malware source code summarization and function patterns from benign pseudocode summarization, which are learned from the above two datasets, respectively, can be transferred to the generation of malware pseudocode. By doing so, we can properly mitigate the issue of unavailable malware pseudocode summarization datasets. More details are available in Section 4.4.

Performing Generation: We use an existing tool [15] to generate pseudocode of a binary file and then generate summaries using the MalT5 model. We first use IDA to construct the malware call graph and then develop an algorithm to transform the graph into a function list in reverse order. Then we iteratively fed the first function in the list to MALT5 to generate the summary. More details are provided in Section 4.1 and 4.2.

Conducting Evaluation: Previous work [24] indicated that existing metrics for generation tasks, such as Bilingual Evaluation Understudy (BLEU) [25], Metric for Evaluation of Translation with Explicit ORdering (METEOR) [26], Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence (ROUGE-L) [27], may not well-fit for evaluation of the binary malware summarization. We thus employ BLEURT-sum, which is more sensitive to the quality of the pseudocode summary, thereby benefiting the evaluation in practice. More descriptions are given in Section 5.1.

We conduct experiments on three datasets to verify the effectiveness of the proposed Malsight framework for binary malware summarization. The contribution of this paper can be summarized as follows111We will release our Malsight to contribute to the community..

  • A binary malware summarization Framework. We propose Malsight, a novel framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Our MalT5 can tackle the challenges of entangled logic and stripped semantics in pseudocode.

  • Large-scale datasets for binary malware summarization. We propose MalS and MalP, two novel datasets that can be used for the LLM training and testing of an LLM of binary malware summarization. To the best of our knowledge, the two datasets are the first in the field, involving nearly 90,000 malicious source functions and 20 types. Our MalS and MalP can serve as a benchmark for various binary malware understanding tasks.

  • An LLM-based binary malware summarization model. We propose MalT5, a novel LLM for the summarization task. The proposed MalT5 is lightweight, with only 0.7B parameters.

  • An evaluation metric for the task: We present BLEURT-sum, a novel evaluation metric that is more sensitive to the quality of pseudocode summarization.

  • Extensive experiments. We conduct extensive experiments on three datasets and provide case studies to show why the proposed framework performs best among all baselines. Results show that our MalT5 achieves comparable performance to ChatGPT3.5.

2 Background

2.1 Malware Analysis Engineering

The field of malware analysis engineering focuses on analyzing the functionality of malware by examining its binaries, typically through static analysis methods that involve observing assembly code or pseudocode [28].

2.1.1 Binary Decompilation

Decompilation [29] converts executable files into human-readable pseudocode [30], which is more concise and structured than disassembled assembly code. Unlike disassembly, which maps instruction encoding directly to assembly statements, decompilation relies on algorithms and patterns (e.g. R2 [31], IDA [15], Ghidra [16]) and emerging methods using LLMs [32]. However, pseudocode lacks semantic information such as function names. Decompiled function names are often unreadable (e.g., sub_4061C0 in IDA Pro) [33], providing a little useful pieces of information for further analysis.

2.1.2 Human Static Analysis

In static analysis, human experts start analyzing from the function entry point [34], inferring functionality from system Application Programming Interface (API) calls, string information, and pseudocode logic. Their main challenge is accurately identifying the core function [35] among numerous functions and methodically tracing the function call [36] process to understand the functionality comprehensively. To assist in this process, we developed Machine Learning-based (ML-based) Malsight, which optimizes and facilitates binary malware analysis.

2.2 NLP Technologies

In the Malsight process, we use Bidirectional Encoder Representation from Transformers (BERT) [37] and Text-to-Text Transfer Transformer (T5) [38] architecture language models to complete specific tasks. For the core code summary task, we build a CodeT5+ model combined with transfer learning.

2.2.1 BERT Family

BERT is a large-scale transformer-based language model pre-trained on a wide corpus of text using a self-supervised learning approach. The design of Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) training tasks makes BERT perform well in the tasks of Sequence Labeling (SL), such as Named Entity Recognition (NER).

Building on BERT, CodeBERT [39] learns code semantics through Code-Conditioned masked language modeling (CMLM) and natural language documentation generation (NLG). CodeBERT has been shown to perform well on code-related tasks, and since it is derived from BERT, we have reason to believe that this model can be fine-tuned to solve the problem of SL in pseudocode.

2.2.2 T5 Family

T5 [38], or Text-to-Text Transfer Transformer, is a sequence-to-sequence model based on the Transformer architecture that unifies various Natural Language Processing (NLP) tasks into a single framework, including text classification, question answering, summarization, translation, and text generation.

CodeT5 [40] is an encoder-decoder model supporting code understanding and generation, built on the T5 architecture. It uses Natural Language-Programming Language (NL-PL) bimodal data for pre-training with identifier tagging and masked identifier prediction tasks. CodeT5+ [23] introduces greater architectural flexibility and additional pre-training tasks, with instruction tuning to enhance alignment with natural language instructions. This results in significant performance improvements on various code-related tasks.

Previous works like HexT5 [20] and BinT5 [19] developed datasets to train models for binary code understanding, including code summarization tasks. These efforts demonstrate the potential of T5-based models in binary code summarization for malware analysis.

2.2.3 Transfer Learning

Transfer learning involves training a model on a source task with abundant labeled data to learn general features. When it is difficult to obtain sufficient datasets for training, transfer learning can be used to supplement them with similar or related datasets [41].

In Malsight, we fine-tune the CodeT5+ [23] model to achieve transfer learning from the source code summarization task to the decompiled code summarization task. Besides, we use dynamic and static annotation to implement feature enhancement to compensate for the poor transfer effect caused by the highly limited similarity between the source code and the stripped decompiled code.

2.3 Code Summary Evaluation

In code summary model evaluation, NLP text similarity algorithms compare generated results with a reference test set, replacing costly human evaluations. These algorithms are categorized into word overlap and word embedding measures.

2.3.1 Words’ Overlap Measure

Early text similarity measures like BLEU [25] and ROUGE [27] rely on word n-gram overlap between generated and reference text, with BLEU focusing on precision and ROUGE on recall. However, they lack semantic understanding. METEOR [26] integrates n-gram overlap and semantic similarity using WordNet, providing additional semantic insight.

Recent work [42] highlights limitations of words’ overlap in code summary tasks. It shows that similar structures may yield high similarity scores despite differing semantics.

2.3.2 Words’ Embedding Measure

The words’ embedding measure evaluates semantic similarity by analyzing the distance between sentence embeddings in a vector space, often utilizing neural network learning.

word2vec [43] is a static embedding model that represents words as points in a vector space, facilitating the proximity of semantically similar words. MoverScore [44] uses an n-gram optimized Word Mover’s Distance (WMD) [45] to measure similarity and employs various embedding models like ELMo [46] and BERT.

BLEURT [47] stands out as a versatile metric designed for assessing various natural language generation tasks, which combines the advantages of both Words’ Overlap Measure and Words’ Embedding Measure. It achieves this by integrating diverse lexical and semantic-level supervision signals into its pre-training process and leveraging synthetic data based on pre-trained BERT, ensuring its effectiveness and versatility in various evaluation scenarios.

Refer to caption
Figure 2: Workflow of Malsight. The procedure involves three steps: graph traversal, annotation generation, and model summary.

3 Motivation and Overview

The construction of the code summary framework mainly includes annotation generation and code summary model construction, as shown in Figure 2. During the evaluation phase, we tested several evaluation methods and found a reasonable way to build an evaluation model for the code summary task. Simultaneously, our work involves the construction of multiple datasets (for subsequent stages of training and evaluation of transfer learning-based models).

3.1 Code Summarization Process

The code summary task is split into three steps, which are function list extraction, annotation generation, and code LLM summary.

3.1.1 Function List Extraction

As mentioned, existing code summary methods for binary focus only on the internal information of the function. We introduced the call relationship between functions and worked on the entire binary as the processing unit. In other words, when function func_E in Figure 2 calls func_F, it is difficult for the subsequent code summarization model to correctly summarize the functionality of func_E without any information about func_F. (We assume that the function name of func_F has been corrupted.) Constructing a list of reverse call sequential relational functions provides a basis for the subsequent recovery of sub-functions functionality.

3.1.2 Annotation Generation

Iterate through the list of functions (assuming Fun_F has been processed), and Fun_E will first be added with annotations by the static annotator and the dynamic annotator, respectively. Fun_E uses the static annotator to obtain static annotations based on the internal information of the function code, while the dynamic annotator adds dynamic annotations based on the generated summary of the sub-function (Func_F) in the function. The program, in other words, sequentially restores functions according to the Control Flow Graph (CFG) diagram from the outermost to the innermost and passes function summary results inward.

3.1.3 Code LLM Summary

Fun_E(annotated) is then fed into the code summary model for final code summary generation. In our work, we use transfer learning to adapt the model to both the functionality of malware functions and the structural features of decompiled pseudocode. Based on the CodeT5+ model, we have fine-tuned the code summary task. The tokenizer splits the code into tokens and embeddings, incorporating a self-attention mechanism into a complete vector in the encoder. The decoder outputs a fine-tuned prediction based on the code summary.

Refer to caption
Figure 3: A Sample Of Existing Evaluation Results. Two unrelated pairs of two sentences form Sample I, and two related pairs form Sample II, but the evaluation results contradict expectations.

3.2 Evaluation Method

Our research has found that existing methods can not simultaneously measure the meaning, structure, word frequency, and other features of the reference sentence and the candidate sentence, so the model’s performance may be misjudged. Taking Figure 3 as an example, two examples show the evaluation results of BLEU, Meteor, and ROUGE-L on two pairs of real code summaries. The figure shows that two code summaries without any semantically related results in high evaluation scores (blue-framed), while two semantically similar code abstracts receive low scores (red-framed), demonstrating the shortage of existing methods. In the following work, we construct an ML-based code summary evaluation method BLEURT-sum by constructing a set of positive and negative samples composed of related sentence pairs and unrelated sentence pairs. We evaluated the usability of the model and prior art by measuring their ability to distinguish between positive and negative samples.

3.3 Datasets Construction

For the two core tasks mentioned above, binary code summary and code summary model evaluation, we build corresponding datasets.

3.3.1 Dataset For Code Summary Model

In order to avoid the data shift problem, the training of the code summary model requires a large dataset of malware pseudocode summary. Unfortunately, malware datasets are typically represented as collections of compiled binary files [48], with the binary code stripped, and possibly structurally confused. Consequently, the creation of code summary datasets for binary malware could be unfeasible without resorting to labor-intensive manual summarization.

In this paper, our key insight is that the code summary model requires two capabilities, understanding of Malware Functionality and adaptability to disassembled pseudocode formats (including the ability to deal with annotated code). Therefore, We build source-based malware datasets and pseudo-code-based benign software datasets to train the model on these two capabilities separately. Based on Sourcefinder [49], we were able to find malware source repositories from GitHub. We generate descriptive labels for the extracted malware functions using a sophisticated language model, followed by manual verification and optimization. Meanwhile, we use the Capybara dataset provided by BinT5 [19] (a benign software dataset) to train the model adaptability to pseudocode structure.

3.3.2 Dataset For Evaluation Model

In the evaluation phase, the evaluation method is used to measure the similarity between the model generation results and the reference results to evaluate the quality of the model generation. Given two sentences (generation results and the reference results) Sgsubscript𝑆𝑔S_{g}italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and Srsubscript𝑆𝑟S_{r}italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, most evaluation methods output a score Score𝑆𝑐𝑜𝑟𝑒Scoreitalic_S italic_c italic_o italic_r italic_e as the evaluation result. Therefore, if considering the use of machine learning methods, it is necessary to construct a dataset in {Sg,Sr,Score}(Score[0,1])subscript𝑆𝑔subscript𝑆𝑟𝑆𝑐𝑜𝑟𝑒𝑆𝑐𝑜𝑟𝑒01\{S_{g},S_{r},Score\}({Score}\in[0,1]){ italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_S italic_c italic_o italic_r italic_e } ( italic_S italic_c italic_o italic_r italic_e ∈ [ 0 , 1 ] ) format. The challenge is that when a dataset of {Sg,Sr}subscript𝑆𝑔subscript𝑆𝑟\{S_{g},S_{r}\}{ italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } is obtained, it is a difficult job to obtain an accurate Score𝑆𝑐𝑜𝑟𝑒Scoreitalic_S italic_c italic_o italic_r italic_e. In our subsequent work, we propose a reasonable algorithmic flow for constructing labeled datasets EvaS𝐸𝑣𝑎𝑆EvaSitalic_E italic_v italic_a italic_S.

3.3.3 Dataset For Static Annotater

In annotation generation process, the static annotater includes a core information extraction module (described in detail in Section 4.2.1). Due to the difficulty in accurately completing the required functions using static methods, we use a machine learning model to complete the sequence labeling task of the pseudocode. By constructing the dataset AnnoS, we have constructed the dataset required for model training and testing.

TABLE I: The Proposed Datasets
Sets for code summary model
Datasets Size(functions) Code language Annotated? Usage
MalS 89,609 C No Train phase1
MalP 500 pseudo Yes Test
BenignC 96,835 pseudo Yes Train phase2
Sets for annotation extractor model
Dataset Size(functions) Code language Anno num(avg) Usage
AnnoS 95,000 pseudo 3.87 Train & Test
Sets for evaluation model
Dataset Size(pairs) Pos\Neg Length(Avg) Usage
EvaS 127,510 1:1 9.6 Train & Test

To sum up, we mainly completed the construction of three sets of datasets in different application fields, as shown in Table I.

4 Code Summarization Workflow

Following our breakdown of the malware code summary task in Figure 2, our implementation first extracts the reverse function list, and then sequentially generates static and dynamic annotations for the items in the function list, and finally passes them into the code summary model.

In this process, we completed the training and designing of two Domain-Specific large model, the applying of a General-Purpose large model and the implementation of several algorithms. we cover the implementation of each step separately in this section.

4.1 Function List Extraction

As mentioned earlier in Section 3.1, in the first step of the workflow, we extract the list of reverse functions from the CFG of the malware binary.

Since the existing methods [50] do not give a completely accurate CFG extraction flow, we implement a pluggable CFG extraction module. It is used to provide us with a processing scheme from binary file BMalsubscript𝐵𝑀𝑎𝑙B_{Mal}italic_B start_POSTSUBSCRIPT italic_M italic_a italic_l end_POSTSUBSCRIPT to digraph GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT as the CFG. By an inverse topological traversal algorithm, it is extracted from CFG in the opposite direction of the call chain, expressed as LGM=[f1,f2,,fn](n=GM.vertices)L_{G_{M}}=[f_{1},f_{2},...,f_{n}](n=G_{M}.vertices)italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ( italic_n = italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT . italic_v italic_e italic_r italic_t italic_i italic_c italic_e italic_s ), Where fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the i𝑖iitalic_i th function of inverse topological order of GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT.

In this study, algorithm REsort𝑅𝐸𝑠𝑜𝑟𝑡REsortitalic_R italic_E italic_s italic_o italic_r italic_t was constructed, expressed as LGM=REsort(GM)subscript𝐿subscript𝐺𝑀𝑅𝐸𝑠𝑜𝑟𝑡subscript𝐺𝑀L_{G_{M}}=REsort(G_{M})italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_R italic_E italic_s italic_o italic_r italic_t ( italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ). (See Appendix B for details) By applying algorithms such as Tarjan [51], Dijkstra [52], and Depth First Search (DFS), this approach successfully addressed the obstacles caused by cyclic calls and partially connected graphs in the process of generating reverse order lists. Function columns LGMsubscript𝐿subscript𝐺𝑀L_{G_{M}}italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPTfollow the order from outside to inside in the function call diagram to ensure that the later-called function is first in the function list and is processed first in subsequent steps.

The following process takes the function instances from LGMsubscript𝐿subscript𝐺𝑀L_{G_{M}}italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT in a forward order to achieve the order of restoration from the outer layer of the CFG diagram to the inner layer, specifically, recovering from the outer API call to the main function.

4.2 Annotation Generation

Refer to caption
Figure 4: Annotation Workflow. If the function currently being processed is Func, its dynamic annotation comes from the child function it calls callee, and the static annotation is generated in three steps. Func’s code summary is then passed to the Caller as a dynamic annotation.

In the order of traversal provided by the reverse function list, annotations are added to each function in turn to provide richer information. Annotations can be divided into static and dynamic types.

We have observed that attackers frequently employ techniques like stripping to hinder reverse engineering efforts. This process removes crucial symbolic information, such as identifier names, from the binaries, leading to significant semantic loss in the generated pseudocode. As a result, code summary models for malware face substantial challenges, and the performance of code summarization tasks is adversely affected.

To address this issue, we propose utilizing dynamic and static annotation to supplement the semantic information and enhance the features of the stripped pseudocode as Figure 4 shows. This approach aims to compensate for the poor transferability caused by the stark dissimilarity between the source code and the stripped decompiled code.

4.2.1 Static Annotation

Based on our long-term exploration of pseudocode for stripped malware, we deem that although the stripped pseudocode has a serious semantic loss, some extremely critical API calls (such as operating system APIs) and some special forms of strings that are preserved after stripping provide us with ideas for behavior analysis and semantic recovery of malware.

Consequently, we consider building a static annotation module to provide additional information for subsequent code summaries. In general, the static annotation process can be divided into three parts: sequence labeling, online retrieval and annotation generation.

Sequence labeling model: By manually labeling approximately 300,000 tokens within nearly 80,000 functions, we construct a labeled dataset SetSL𝑆𝑒subscript𝑡𝑆𝐿Set_{SL}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_S italic_L end_POSTSUBSCRIPT based on BenignS𝐵𝑒𝑛𝑖𝑔𝑛𝑆BenignSitalic_B italic_e italic_n italic_i italic_g italic_n italic_S to train the sequence labeling model.

Formally, the function fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is first sliced into an n-token code sequence by the tokenizer T𝑇Titalic_T. The n-token code sequence si=T(fi)={t0,t1,,tn1}subscript𝑠𝑖𝑇subscript𝑓𝑖subscript𝑡0subscript𝑡1subscript𝑡𝑛1s_{i}=T(f_{i})=\{t_{0},t_{1},...,t_{n-1}\}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_T ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT } and the labels of the tokens Li={l0,l1,,ln1}subscript𝐿𝑖subscript𝑙0subscript𝑙1subscript𝑙𝑛1L_{i}=\{l_{0},l_{1},...,l_{n-1}\}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT } are combined into di={(t,l)|tsi,lLi}subscript𝑑𝑖conditional-set𝑡𝑙formulae-sequence𝑡subscript𝑠𝑖𝑙subscript𝐿𝑖d_{i}=\{(t,l)|t\in s_{i},l\in L_{i}\}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( italic_t , italic_l ) | italic_t ∈ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_l ∈ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. Equation (1) formalizes this dataset.

SetCSL={d0,d1,,dN}𝑆𝑒subscript𝑡𝐶𝑆𝐿subscript𝑑0subscript𝑑1subscript𝑑𝑁Set_{CSL}=\{d_{0},d_{1},...,d_{N}\}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_C italic_S italic_L end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } (1)

Subsequently, we opted to utilize the CodeBERT [39] model for training the sequence labeling task using this dataset CSL.

B𝐵Bitalic_B is the CodeBERT base model, and C𝐶Citalic_C is the classifier head. As a sequence is inputted, the complete model outputs the predicted labels of its tokens. Equation (2) and (3) formalize this.

B(si)={o0,o1,,on1}𝐵subscript𝑠𝑖subscript𝑜0subscript𝑜1subscript𝑜𝑛1B(s_{i})=\{o_{0},o_{1},...,o_{n-1}\}\vspace{-0.2cm}italic_B ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_o start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT } (2)
C(oi)=yi𝐶subscript𝑜𝑖subscript𝑦𝑖C(o_{i})=y_{i}italic_C ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (3)

We further formalize the target function in equation (4), where licsubscript𝑙subscript𝑖𝑐l_{i_{c}}italic_l start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the truth label and yicsubscript𝑦subscript𝑖𝑐y_{i_{c}}italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the Softmax probability for the cthsuperscript𝑐𝑡c^{th}italic_c start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT class.

LF=c=02liclogyic𝐿𝐹superscriptsubscript𝑐02subscript𝑙subscript𝑖𝑐subscript𝑦subscript𝑖𝑐LF=-\sum\limits_{c=0}^{2}l_{i_{c}}\log{y_{i_{c}}}italic_L italic_F = - ∑ start_POSTSUBSCRIPT italic_c = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT (4)

By optimizing LF𝐿𝐹LFitalic_L italic_F, B𝐵Bitalic_B and C𝐶Citalic_C are trained simultaneously. This necessitates that the model effectively classifies code-tokens to accurately label the key API calls and special strings within the pseudocode.

Label to annotation: Once we can get the key API calls and special strings in the stripped pseudocode, we use GitHub Code Search [53] to retrieve the relevant context in the GitHub repositories. In the implementation, we keep the first three blocks of the search results (this is because the code search has already sorted the relevance of the results [54]).

The outcome of random sampling and manual discrimination reveals that approximately 54.8% of the function contexts contain code comments closely associated with the functionality of the function. For the remaining functions, nearly 90% also offer contextual information related to the function’s operation, such as parameter names, interconnected functions, and processing logic. Only a small number of functions yield invalid search results.

The filtered and preprocessed code snippets will be continuously input into the prompt-based generic model for generating static annotation.

4.2.2 Dynamic Annotation

As shown in Figure 4 (the blue parts represent the steps in which the annotation was added), the summary of the callee is provided to the caller as a complement to the semantic information, which we define as dynamic annotation. This is consistent with the actual analysis flow of binary malware analysis by reverse workers, i.e., analyzing the call relationship from the inner layer of the CFG diagram to the outer layer (corresponding function list generated in Section 4.1) and summarizing the function from the outside in (corresponding to the passing of dynamic annotation). In this way, we can make full use of the dynamic behavior characteristics implied by the call relationships between functions in the pseudocode.

4.3 Building Malware Datasets

In the traditional scheme of building datasets for decompiled code, the datasets are built at the function level [55]. The source function fnSourcesuperscriptsubscript𝑓𝑛𝑆𝑜𝑢𝑟𝑐𝑒f_{n}^{Source}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_o italic_u italic_r italic_c italic_e end_POSTSUPERSCRIPT is compiled and linked with other modules to generate an executable file, as fnBinsuperscriptsubscript𝑓𝑛𝐵𝑖𝑛f_{n}^{Bin}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B italic_i italic_n end_POSTSUPERSCRIPT, and then decompiled to obtain the pseudocode form fnpseudosuperscriptsubscript𝑓𝑛𝑝𝑠𝑒𝑢𝑑𝑜f_{n}^{pseudo}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_s italic_e italic_u italic_d italic_o end_POSTSUPERSCRIPT of the corresponding function. Equation (5) formalizes this dataset, where SUM()𝑆𝑈𝑀SUM()italic_S italic_U italic_M ( ) is the extraction method for the code summary.

Setideal={(fnpse,fnsum)|fnsum=SUM(fnpse)}𝑆𝑒subscript𝑡𝑖𝑑𝑒𝑎𝑙conditional-setsuperscriptsubscript𝑓𝑛𝑝𝑠𝑒superscriptsubscript𝑓𝑛𝑠𝑢𝑚superscriptsubscript𝑓𝑛𝑠𝑢𝑚𝑆𝑈𝑀superscriptsubscript𝑓𝑛𝑝𝑠𝑒Set_{ideal}=\{(f_{n}^{pse},f_{n}^{sum})|f_{n}^{sum}=SUM(f_{n}^{pse})\}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_i italic_d italic_e italic_a italic_l end_POSTSUBSCRIPT = { ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_s italic_e end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_m end_POSTSUPERSCRIPT ) | italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_m end_POSTSUPERSCRIPT = italic_S italic_U italic_M ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_s italic_e end_POSTSUPERSCRIPT ) } (5)

Out of 2,289 GitHub repositories that were determined to be malware, we extracted close to 30K functions. We filter for functions that repeat, shorter than five lines, and format-challenged functions, resulting in a dataset of 89,609 functions. (The lack of strict filtering may lead to overlap between the train and test sets, consequently yielding inflated results.)

Unlike benign open-source projects that are well maintained, most of the functions we extract do not have comments in context for us to label as code summaries. Fortunately, the semantic information in the source code is rich enough that we used a well-designed prompt to complete the code summary for us via GPT3.5-Turbo [56]. In this way, we extract dataset MalS in equation (6) format. Further, we build a 500-function dataset MalP for testing our model. The reason why dataset MalP has a relatively small quantity is that it was obtained by manually compiling and decompiling the git repository. MalP is compiled from the makefile provided by the developer, so we have not mentioned the configuration related to compilation optimization.

SetMalS={(fnSou,fnsum)|fnsum=SUM(fnSou)}𝑆𝑒subscript𝑡𝑀𝑎𝑙𝑆conditional-setsuperscriptsubscript𝑓𝑛𝑆𝑜𝑢superscriptsubscript𝑓𝑛𝑠𝑢𝑚superscriptsubscript𝑓𝑛𝑠𝑢𝑚𝑆𝑈𝑀superscriptsubscript𝑓𝑛𝑆𝑜𝑢Set_{MalS}=\{(f_{n}^{Sou},f_{n}^{sum})|f_{n}^{sum}=SUM(f_{n}^{Sou})\}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_M italic_a italic_l italic_S end_POSTSUBSCRIPT = { ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_o italic_u end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_m end_POSTSUPERSCRIPT ) | italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_m end_POSTSUPERSCRIPT = italic_S italic_U italic_M ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_o italic_u end_POSTSUPERSCRIPT ) } (6)
SetMalP={(fnpse,fnsum)|fnsum=SUM(fnSou)}𝑆𝑒subscript𝑡𝑀𝑎𝑙𝑃conditional-setsuperscriptsubscript𝑓𝑛𝑝𝑠𝑒superscriptsubscript𝑓𝑛𝑠𝑢𝑚superscriptsubscript𝑓𝑛𝑠𝑢𝑚𝑆𝑈𝑀superscriptsubscript𝑓𝑛𝑆𝑜𝑢Set_{MalP}=\{(f_{n}^{pse},f_{n}^{sum})|f_{n}^{sum}=SUM(f_{n}^{Sou})\}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_M italic_a italic_l italic_P end_POSTSUBSCRIPT = { ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_s italic_e end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_m end_POSTSUPERSCRIPT ) | italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_m end_POSTSUPERSCRIPT = italic_S italic_U italic_M ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_o italic_u end_POSTSUPERSCRIPT ) } (7)

As mentioned above, the malware training set we built was made up of source code, so another dataset was needed to help our model understand the stripped function features. In the transfer learning option, we used the Capybara dataset (a benign software dataset) provided by BinT5 [19] and annotated it to provide our model with adaptations to the annotated code summary task. It is provided as SetCapybara𝑆𝑒subscript𝑡𝐶𝑎𝑝𝑦𝑏𝑎𝑟𝑎Set_{Capybara}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_C italic_a italic_p italic_y italic_b italic_a italic_r italic_a end_POSTSUBSCRIPT, and we process it as SetbenignC𝑆𝑒subscript𝑡𝑏𝑒𝑛𝑖𝑔𝑛𝐶Set_{benignC}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_b italic_e italic_n italic_i italic_g italic_n italic_C end_POSTSUBSCRIPT, where the ANN()𝐴𝑁𝑁ANN()italic_A italic_N italic_N ( ) is the static annotation generation method to form PL-NL bimodal data fnannsuperscriptsubscript𝑓𝑛𝑎𝑛𝑛f_{n}^{ann}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_n italic_n end_POSTSUPERSCRIPT.

SetCapybara={(fnpse,fnsum)|fnsum=SUM(fnSou)}𝑆𝑒subscript𝑡𝐶𝑎𝑝𝑦𝑏𝑎𝑟𝑎conditional-setsuperscriptsubscript𝑓𝑛𝑝𝑠𝑒superscriptsubscript𝑓𝑛𝑠𝑢𝑚superscriptsubscript𝑓𝑛𝑠𝑢𝑚𝑆𝑈𝑀superscriptsubscript𝑓𝑛𝑆𝑜𝑢Set_{Capybara}=\{(f_{n}^{pse},f_{n}^{sum})|f_{n}^{sum}=SUM(f_{n}^{Sou})\}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_C italic_a italic_p italic_y italic_b italic_a italic_r italic_a end_POSTSUBSCRIPT = { ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_s italic_e end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_m end_POSTSUPERSCRIPT ) | italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_m end_POSTSUPERSCRIPT = italic_S italic_U italic_M ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S italic_o italic_u end_POSTSUPERSCRIPT ) } (8)
SetbenignC={(fnann,fnsum)|fnann=ANN(fnpse)}𝑆𝑒subscript𝑡𝑏𝑒𝑛𝑖𝑔𝑛𝐶conditional-setsuperscriptsubscript𝑓𝑛𝑎𝑛𝑛superscriptsubscript𝑓𝑛𝑠𝑢𝑚superscriptsubscript𝑓𝑛𝑎𝑛𝑛𝐴𝑁𝑁superscriptsubscript𝑓𝑛𝑝𝑠𝑒Set_{benignC}=\{(f_{n}^{ann},f_{n}^{sum})|f_{n}^{ann}=ANN(f_{n}^{pse})\}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_b italic_e italic_n italic_i italic_g italic_n italic_C end_POSTSUBSCRIPT = { ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_n italic_n end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_u italic_m end_POSTSUPERSCRIPT ) | italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_n italic_n end_POSTSUPERSCRIPT = italic_A italic_N italic_N ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_s italic_e end_POSTSUPERSCRIPT ) } (9)

In the process of training the code summary model, we use SetMalS𝑆𝑒subscript𝑡𝑀𝑎𝑙𝑆Set_{MalS}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_M italic_a italic_l italic_S end_POSTSUBSCRIPT and SetbenignC𝑆𝑒subscript𝑡𝑏𝑒𝑛𝑖𝑔𝑛𝐶Set_{benignC}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_b italic_e italic_n italic_i italic_g italic_n italic_C end_POSTSUBSCRIPT to complete the transfer learning process, and SetMalP𝑆𝑒subscript𝑡𝑀𝑎𝑙𝑃Set_{MalP}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_M italic_a italic_l italic_P end_POSTSUBSCRIPT to test during the evaluation phase.

Refer to caption
Figure 5: Code Summary Topic Distribution. The results of topic analysis for SetMalS𝑆𝑒subscript𝑡𝑀𝑎𝑙𝑆{Set_{MalS}}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_M italic_a italic_l italic_S end_POSTSUBSCRIPT indicate that its distribution conforms to the functional distribution of malware functions.

For SetMalS𝑆𝑒subscript𝑡𝑀𝑎𝑙𝑆Set_{MalS}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_M italic_a italic_l italic_S end_POSTSUBSCRIPT, 25% of the samples were extracted for thematic analysis of function functionality, as shown in Figure 5. Our analysis results confirm that SetMalS𝑆𝑒subscript𝑡𝑀𝑎𝑙𝑆Set_{MalS}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_M italic_a italic_l italic_S end_POSTSUBSCRIPT and SetBenignC𝑆𝑒subscript𝑡𝐵𝑒𝑛𝑖𝑔𝑛𝐶Set_{BenignC}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_B italic_e italic_n italic_i italic_g italic_n italic_C end_POSTSUBSCRIPT have significantly different theme distributions. Among a large number of security-related functions in the former, there are an unignorable number of codes for lock, permissions, and encryption, while in the latter, there are a large number of codes related to game logic and driver calls which irrelevant to malware.

4.4 Code Summary Model

In Malsight, we fine-tune the CodeT5+ model to enable transfer learning from source code summarization to decompiled pseudocode summarization. To achieve this, we have designed two distinct phases of fine-tuning to facilitate smooth transfers that accommodate variations in data characteristics and distribution biases across different datasets.

Phase 1. In the initial phase, we fine-tune the model that uses the dataset we built consisting of malware source code SetMalS𝑆𝑒subscript𝑡𝑀𝑎𝑙𝑆Set_{MalS}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_M italic_a italic_l italic_S end_POSTSUBSCRIPT. The source code of malware has a similar semantic structure to the code in ordinary scenarios, but it also has behavioral or semantic features that the latter does not exist, which are represented by some code fragments with malicious purposes, such as self-replication and propagation, illegal access to system resources, and vulnerability exploitation. Through this phase of fine-tuning, the model can learn the behavioral or semantic features of these malicious codes. In this phase, the poor effects of data shift are addressed by using SetMalS𝑆𝑒subscript𝑡𝑀𝑎𝑙𝑆Set_{MalS}italic_S italic_e italic_t start_POSTSUBSCRIPT italic_M italic_a italic_l italic_S end_POSTSUBSCRIPT with the same functional distribution as malware in the real world.

Phase 2. In this phase, we ask the model to learn the corresponding semantic information from the pseudo-code and the annotation text simultaneously to better assist the model in generating high-quality code summaries. We use the PL-NL bimodal dataset BenignC to fine-tune the encoder of the model which we fine-tuned in the previous phase in case the decoder’s parameters are frozen. By only fine-tuning the encoder, we allow the model to adjust itself to better suit the specific task at hand, without updating the weights too much. This can not only lead to a more robust and generalized understanding of the data but also shorten the training time due to the reduction in parameters. Especially, the raw input is processed into a standardized format as equation (10) where tcisubscript𝑡subscript𝑐𝑖t_{c_{i}}italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the tokens from a code sequence, taisubscript𝑡subscript𝑎𝑖t_{a_{i}}italic_t start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the tokens from the corresponding annotation, tsepsubscript𝑡𝑠𝑒𝑝t_{sep}italic_t start_POSTSUBSCRIPT italic_s italic_e italic_p end_POSTSUBSCRIPT is a special token in CodeT5+ to separate the inputs in different modes. Inputs in such a format can assist the encoder in distinguishing the difference between two modes and learning the correlation between them.

fnann{tc0,,tcn1,tsep,ta0,,tam1}superscriptsubscript𝑓𝑛𝑎𝑛𝑛subscript𝑡subscript𝑐0subscript𝑡subscript𝑐𝑛1subscript𝑡𝑠𝑒𝑝subscript𝑡subscript𝑎0subscript𝑡subscript𝑎𝑚1f_{n}^{ann}\to\{t_{c_{0}},...,t_{c_{n-1}},t_{sep},t_{a_{0}},...,t_{a_{m-1}}\}italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a italic_n italic_n end_POSTSUPERSCRIPT → { italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_s italic_e italic_p end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } (10)

After the above two phases of fine-tuning, the model basically has the ability to accept bimodal input composed of pseudocode and annotation text and generate code summary. We use this model to summarize the entire malware in reverse of the function call order.

5 Evaluation Method

As mentioned in Section 3.3.2, the evaluation algorithm for code summary tasks should accept the reference as input and separate available and unavailable generated results. In this section, we introduce our exploration of code summary dataset construction (EvaS) and evaluation model construction (BLEURT-sum) respectively.

5.1 Evaluation Dataset Construction

Utilizing the code summary from the MalS dataset as a foundation for our research, we curated a positive and negative sample pair for the tuning of our new evaluation model BLEURT-sum. The positive sample consisted of two code summary result sentences sharing the same meaning and was initialized as {Sg,Sr,1}subscript𝑆𝑔subscript𝑆𝑟1\{S_{g},S_{r},1\}{ italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , 1 }, while the negative sample comprised two randomly different sentences and was initialized as {Sg,Sr,0}subscript𝑆𝑔subscript𝑆𝑟0\{S_{g},S_{r},0\}{ italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , 0 }. Sgsubscript𝑆𝑔S_{g}italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and Srsubscript𝑆𝑟S_{r}italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT represent generated statement and reference statement, respectively.

In order to build the dataset in {Sg,Sr,Score}(Score[0,1])subscript𝑆𝑔subscript𝑆𝑟𝑆𝑐𝑜𝑟𝑒𝑆𝑐𝑜𝑟𝑒01\{S_{g},S_{r},Score\}({Score}\in[0,1]){ italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_S italic_c italic_o italic_r italic_e } ( italic_S italic_c italic_o italic_r italic_e ∈ [ 0 , 1 ] ) format, one possible idea is to build an algorithm, Score=GenSim(Sg,Sr,0or1)𝑆𝑐𝑜𝑟𝑒𝐺𝑒𝑛𝑆𝑖𝑚subscript𝑆𝑔subscript𝑆𝑟0𝑜𝑟1Score=GenSim(S_{g},S_{r},0or1)italic_S italic_c italic_o italic_r italic_e = italic_G italic_e italic_n italic_S italic_i italic_m ( italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , 0 italic_o italic_r 1 ), to automatically generate the label Score𝑆𝑐𝑜𝑟𝑒Scoreitalic_S italic_c italic_o italic_r italic_e required for model training. Therefore, the key task is to construct a reliable GenSim()𝐺𝑒𝑛𝑆𝑖𝑚GenSim()italic_G italic_e italic_n italic_S italic_i italic_m ( ) function. When Sg==SUM(fn)S_{g}==SUM(f_{n})italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = = italic_S italic_U italic_M ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), contrusted Score𝑆𝑐𝑜𝑟𝑒Scoreitalic_S italic_c italic_o italic_r italic_e can be expressed using equation (11), considering that the SUM() function produces different outputs when faced with the same input, which is a non-deterministic function.

Sr=SUM(fn)Score={1if Sg=SUM(fn)0if Sg=SUM(¬fn)subscript𝑆𝑟SUMsubscript𝑓𝑛𝑆𝑐𝑜𝑟𝑒cases1if subscript𝑆𝑔SUMsubscript𝑓𝑛0if subscript𝑆𝑔SUMsubscript𝑓𝑛S_{r}\!=\!\text{SUM}(f_{n})\!\implies\!Score\!=\!\begin{cases}1\!&\!\text{if }% S_{g}=\text{SUM}(f_{n})\\ 0\!&\!\text{if }S_{g}=\text{SUM}(\neg f_{n})\end{cases}italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = SUM ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ⟹ italic_S italic_c italic_o italic_r italic_e = { start_ROW start_CELL 1 end_CELL start_CELL if italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = SUM ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = SUM ( ¬ italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW (11)

Since when {Sg,Sr}subscript𝑆𝑔subscript𝑆𝑟\{S_{g},S_{r}\}{ italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT } pairs were built, no sentence structure dependencies are taken into account, which means the equation (11) of Score can contain a few words overlap-based features (§ 2.3), the natural consideration is to combine the characteristics of sentence structure and semantics. Our idea is to solve the proportion of semantic information and sentence structure information in sentence similarity evaluation.

Taking the original 0,1 tag as the semantic feature, we further extract the static feature to get a Score, multiply the two by the corresponding proportion respectively, and then add them to get Score𝑆𝑐𝑜𝑟𝑒Scoreitalic_S italic_c italic_o italic_r italic_e. Assuming that the proportion of semantic information is p𝑝pitalic_p, the structural feature calculation function is Struc()𝑆𝑡𝑟𝑢𝑐Struc()italic_S italic_t italic_r italic_u italic_c ( ), the following equations formalize this Score construction method.

sf=Struc(Sg,Sg)subscript𝑠𝑓𝑆𝑡𝑟𝑢𝑐subscript𝑆𝑔subscript𝑆𝑔\displaystyle s_{f}=Struc(S_{g},S_{g})italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_S italic_t italic_r italic_u italic_c ( italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) (12)
Score={p+(1p)sfifSg=SUM(fn)(1p)sfifSg=SUM(f¬n)𝑆𝑐𝑜𝑟𝑒cases𝑝1𝑝subscript𝑠𝑓ifsubscript𝑆𝑔SUMsubscript𝑓𝑛1𝑝subscript𝑠𝑓ifsubscript𝑆𝑔SUMsubscript𝑓𝑛\displaystyle Score\!=\!\begin{cases}p+(1-p)*s_{f}\!&\!\text{if}S_{g}=\!\text{% SUM}(f_{n})\\ (1-p)*s_{f}\!&\!\text{if}S_{g}=\!\text{SUM}(f_{\neg n})\end{cases}italic_S italic_c italic_o italic_r italic_e = { start_ROW start_CELL italic_p + ( 1 - italic_p ) ∗ italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_CELL start_CELL if italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = SUM ( italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ( 1 - italic_p ) ∗ italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_CELL start_CELL if italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = SUM ( italic_f start_POSTSUBSCRIPT ¬ italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW (13)

In the implementation, we utilize the arithmetic average of BLEU, ROUGE-L, and METEOR as the metric for Struc()𝑆𝑡𝑟𝑢𝑐Struc()italic_S italic_t italic_r italic_u italic_c ( ) (they have been normalized to the zero-one interval using a uniform probability distribution, respectively). Then we solve the minimum square error and get the semantic and structural ratio close to 1:4, which makes p=1/5𝑝15p=1/5italic_p = 1 / 5.

5.2 Evaluation Model

Based on our insights into the evaluation model, it should be able to combine both structural and semantic features and give an evaluation score for the specific task of code summary.

BLEURT is based on BERT and adds additional pre-training steps between pre-training and fine-tuning to the synthesized data. Synthetic data is generated by perturbing sentence pairs <z,z~><z,\widetilde{z}>< italic_z , over~ start_ARG italic_z end_ARG >, where z𝑧zitalic_z and z~~𝑧\widetilde{z}over~ start_ARG italic_z end_ARG are randomly selected sentence pairs. In the additional pre-training of BLEURT, a series of pre-training signals (τ1,τ2,τ9)subscript𝜏1subscript𝜏2subscript𝜏9(\tau_{1},\tau_{2},...\tau_{9})( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_τ start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT ) to align the model with the desired result. BLEURT uses the sentence pair scores of BLEU, ROUGE, and BERTScore as signals τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to τ3subscript𝜏3\tau_{3}italic_τ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and uses the back translation processing sentence pairs to generate τ4subscript𝜏4\tau_{4}italic_τ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT to τ7subscript𝜏7\tau_{7}italic_τ start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT.

The rich training signals ensure the universality of BLEURT and the comprehensiveness of the evaluation angle. We then further fine-tune the BLEURT model on the code summary sentence pair dataset. First, the generated text and the reference text are input together into the model and the vector is generated as shown in equation (14). This solution is called BLEURT-sum.

v[CLS],vSg1,,vSgn,,vSrn=BLEURT(Sg,Sr)subscript𝑣delimited-[]𝐶𝐿𝑆subscript𝑣subscript𝑆𝑔1subscript𝑣subscript𝑆𝑔𝑛subscript𝑣subscript𝑆𝑟𝑛𝐵𝐿𝐸𝑈𝑅𝑇subscript𝑆𝑔subscript𝑆𝑟v_{[CLS]},v_{S_{g1}},...,v_{S_{gn}},...,v_{S_{rn}}=BLEURT(S_{g},S_{r})italic_v start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_g 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_g italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_r italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_B italic_L italic_E italic_U italic_R italic_T ( italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) (14)

Further, the model uses the CLS vector to add antecedents to obtain the scores predicted by the model, as shown in equation (15).

Score^=f(Sg,Sr)=Wv~[CLS]+b^𝑆𝑐𝑜𝑟𝑒𝑓subscript𝑆𝑔subscript𝑆𝑟𝑊subscript~𝑣delimited-[]𝐶𝐿𝑆𝑏\hat{Score}=f(S_{g},S_{r})=W\tilde{v}_{[CLS]}+bover^ start_ARG italic_S italic_c italic_o italic_r italic_e end_ARG = italic_f ( italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = italic_W over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT [ italic_C italic_L italic_S ] end_POSTSUBSCRIPT + italic_b (15)

Finally, the model starts to complete the loss calculation based on the loss and then carries out the gradient descent (equation (16)).

loss=1Nn=1NScoreScore^2𝑙𝑜𝑠𝑠1𝑁superscriptsubscript𝑛1𝑁superscriptnorm𝑆𝑐𝑜𝑟𝑒^𝑆𝑐𝑜𝑟𝑒2loss=\frac{1}{N}\sum\limits_{n=1}^{N}\left|\left|Score-\hat{Score}\right|% \right|^{2}italic_l italic_o italic_s italic_s = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | | italic_S italic_c italic_o italic_r italic_e - over^ start_ARG italic_S italic_c italic_o italic_r italic_e end_ARG | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (16)

In existing works, Word overlap measures are commonly used in text similarity evaluation but perform poorly for code summary evaluation. Code summaries often include keywords like “retur” and “initializ”, which do not significantly contribute to the overall meaning. These keywords can inflate overlap rates and lead to misleadingly high scores. Therefore, BLEURT-sum on the one hand still considers Word overlap measures, and on the other hand introduces more dimensions, which greatly improves the performance in code summary evaluation tasks.

6 Experiments

Our evaluation experiment was designed to answer the following four questions:

  1. 1.

    RQ1 (§ 6.2). How does Malsight performance compare across two training phases and mainstream code summarization models?

  2. 2.

    RQ2 (§ 6.3). How do module combinations and different stripping scenarios affect Malsight’s performance, particularly the annotator module?

  3. 3.

    RQ3 (§ 6.4). How does the new BLEURT-sum evaluation method compare to existing code summarization evaluation metrics?

  4. 4.

    RQ4 (§ 6.5). How does Malsight perform when applied to real-world malicious software, and how do assessments from human reverse engineers validate its usability?

6.1 Experimental Setup

Our experiment is running on the Ubuntu 20.04 system, equipped with one Intel Xeon Silver 4210 CPU 2.20 GHz, and two NVIDIA A40 GPUs with 125 GB RAM. Binary file processing tools include Radare2, IDA Pro v7.5, GCC v9.4.0, and GNU Make v4.2.1. Our programming language is Python v3.8.13, with transformers v4.16.2, torch v2.1.2+cu121.

6.2 RQ1. Performance Test

6.2.1 Code Summary Performance

During training, the model undergoes two phases. The first phase involves fine-tuning malware source code (MalS) to learn malicious behavioral features. The second phase focuses on generating high-quality code summaries (BenignC) by incorporating semantic information from pseudocode and annotations. In both phases, we split the dataset into training, cross-validation, and test sets with a ratio of 7:2:1. We then evaluated the model at each stage to ensure its effectiveness. The result is shown in Table II.

Since each phase uses a different dataset, their effectiveness should be verified independently rather than by direct score comparison. Considering there is no universally accepted threshold for model evaluation scores, we provide the performance of another state-of-the-art binary code summarization model (CP-BCS [55]) as a baseline. CP-BCS was tested on its own dataset, which consists of binaries compiled with GCC 7.3.0 for x86 architecture (32-bit) and then stripped. Our results in both phases demonstrate higher usability compared to CP-BCS.

TABLE II: Score During Two Train Phase
BLEURT-sum BLEU ROUGE-L METEOR
Phase1 74.17 18.06 38.56 24.69
Phase2 72.74 22.61 41.19 25.37
CP-BCS(Baseline) 17.78 21.50 16.89 11.92

After completing the training step, we use MalP as the test set (it is the only pseudocode summary dataset built for malware that we know until now). We compare Malsight to existing pseudocode summarization methods and the popular prompt-based general-purpose large model on dataset MalP. In the actual experiment, we built three versions of data for MalP: not-stripped, demi-stripped (only stripping the function name), and all-stripped (stripping the function name and all identifiers inside the function).

The closest version of the dataset to the real world, that is, the results of the all-stripped version as test set, are shown in Table III.

TABLE III: Comparison with baseline work
BLEURT-sum BLEU ROUGE-L METEOR AVG time (function) AVG summary length BLEURT-sum variance
BinT5 21.18 1.92 9.45 3.51 0.16 6.81 201.11
HexT5 27.67 2.71 11.23 3.22 0.13 7.11 216.81
WizardCoder-15B 53.43 7.75 23.16 13.71 2.44 32.20 303.03
Code Llama-7b 56.00 8.52 24.55 14.95 2.24 31.32 290.16
Code T5+ 17.18 1.74 4.17 2.54 0.1743 7.2869 161.34
WizardLM-2-7B 55.81 5.33 18.46 15.61 22.07 61.74 265.63
deepseek-llm-7b-chat 50.88 7.07 20.08 12.70 7.41 25.48 306.78
ChatGPT-3.5 60.09 9.96 25.19 16.54 - 25.27 296.81
Malsight (Ours) 62.14 9.80 25.11 16.87 2.51 35.27 131.65
* AVG time (function) is measured in seconds.
* AVG summary length is measured in words.

As shown in Table III, we conducted experiments using evaluation methods such as BLEURT-sum, BLEU, ROUGE, and METEOR (the usability of BLEURT-sum is substantiated in the referenced paper) to carry out experiments on Malsight, WizardLM [57], and Code Llama [58], etc. We also measured model performance using criteria such as summary length, variance, and processing time, as these factors can directly impact algorithm-based evaluations (see Appendix I for details).

As previously mentioned, general-purpose large models, trained on extensive corpora, possess broader knowledge, providing an advantage in code summarization tasks. Among these, GPT, as a commercial large model, performed the best with a BLEURT-sum score of 0.6009. Malsight’s innovative fine-tuning approach offers a significant advantage in summarizing malware code, with its annotation generation effectively bridging the knowledge gap between specialized and general-purpose models. Malsight achieved the highest scores among all methods in both BLEURT-sum and METEOR, which we consider two of the most reliable evaluation metrics.

This conclusion can be corroborated by Figure 6. The figure shows that Malsight has better score evaluation results and less variance, i.e., more stable code summary output, on test results stripped of data.

Refer to caption
Figure 6: Data distribution. Malsight presents a relatively stable high segmentation distribution, which beats GPT-3.5 work.

6.2.2 Annotation Extraction Effect

The annotation extraction model was trained and tested on the AnnoS dataset, and the test results are shown in the table The annotated extraction model was trained and tested on the AnnoS dataset to complete the sequence SL task, dividing the data into normal codes (represented by N-label), important function APIs (represented by A-label), and important strings (represented by S-label). The model test produces the confusion matrix shown in Table IV below.

TABLE IV: Annotation extraction confusion matrix
Actual/Predicted N-label(Predicted) A-label(Predicted) I-label(Predicted)
N-label(Actual) 1,847,647 9,345 11,721
A-label(Actual) 15,865 116,099 1,784
I-label(Actual) 21,639 1,688 38,933

The results show that the accuracy of the model is 96.99%, which is almost comparable to the results obtained by manual annotation.

6.3 RQ2. Ablation Experiment

To verify the usability of each module in Malsight, we conducted ablation experiments on the annotation module and the two phases of code summarization. We continued to use MalP as a test dataset, experimenting with different module combinations.

6.3.1 Module Ablation

As shown in Table V, the modules in Malsight complemented each other effectively. This ablation study was conducted on the not-stripped version of MalP to minimize result fluctuations due to the absence of dynamic annotation.

TABLE V: Ablation experiment
Model BLEURT-sum BLEU ROUGE-L METEOR
Full Model 66.29 10.95 26.97 18.80
w/o Annotation 60.05 7.65 22.76 14.93
w/o Phase 1 62.05 10.18 24.16 17.87
w/o Phase 1 and Annotation 56.48 7.35 21.99 13.99
w/o Phase 2 61.93 9.19 26.25 15.37
w/o Phase 2 and Annotation 59.13 7.42 26.72 13.32
TABLE VI: Performance on different levels of stripping
Stripping Levels BLEURT-sum BLEU ROUGE-L METEOR AVG time (function) AVG summary length BLEURT-sum variance
Not-Stripped 66.29 10.95 26.97 18.80 2.28 32.85 111.39
Not-Stripped w/o Annotation 60.05 7.65 22.76 14.93 2.01 29.12 143.62
Demi-Stripped 64.01 10.42 26.03 17.62 2.44 34.29 123.30
Demi-Stripped w/o Annotation 58.04 8.27 23.36 14.25 2.12 30.92 146.90
All-Stripped 62.14 9.80 25.11 16.88 2.51 35.27 131.65
All-Stripped w/o Annotation 54.98 7.82 22.31 13.36 2.16 32.40 150.75
* AVG time (function) is measured in seconds.
* AVG summary length is measured in words.

We designed five experiments to test different combinations for the code summarization task, as follows: (1) Removing the annotation module and summarizing without any annotation, labeled as “w/o Annotation”; (2) Canceling the phase 1, labeled as “w/o Phase ”; (3) Canceling the phase 1 and removing the annotation module, labeled as “w/o Phase 1 and Annotatio”; (4) Canceling the phase 2, labeled as “w/o Phase 2”; (5) Canceling the phase 2 and removing the annotation module, labeled as “w/o Phase 2 and Annotation”.

The results indicate that different degrees of ablation have varying negative impacts on systematic performance. “w/o Phase 1 and Annotation” suffered a data shift and thus exhibited worse performance on the malware dataset than the other combinations, 56.48 in BLEURT-sum. Notably, the absence of the annotation generation module (static annotation generation module) significantly affects performance, demonstrating its effectiveness.

6.3.2 Annotater vs. Stripping

The influence of different levels of strip on the annotation module continues to be explored, and the results are shown in Table VI. Experiments have shown that a higher degree of stripping creates a tolerable performance degradation in the absence of annotation, demonstrating the robustness of Malsight. After the annotation module was ablated, it produced a 9.41% performance degradation (measured by a BLEURT-sum score) in the not-stripped test and nearly 12% in the all-stripped test, which proved that the annotator effectively resisted the adverse conditions caused by stripping.

Refer to caption
Figure 7: Annotation Ablation. The Annotation generation module adds more semantic information to sentence A, making it able to output function functionality smoothly, but after ablation, it cannot.

In the generated results, we selected relatively representative sentences, as shown in Figure 7. In the yellow box, the output from the generation module without annotations shows that the large model is highly susceptible to biases due to hallucination [59]. Based on our observations, hallucinations can cause the code summary model to make false guesses about the types of parameters and internally called functions, leading to potential user misdirection. To mitigate this issue, we incorporate annotations, which not only enhance the model’s ability to summarize behavior, structure, and application scenarios but also reduce the nonsensical outputs caused by these illusions.

6.4 RQ3. Evaluation Algorithm Test

Refer to caption
(a) BLEU
Refer to caption
(b) METEOR
Refer to caption
(c) word2vec
Refer to caption
(d) MoverScore
Figure 8: Relevant method test results. The evaluation effect of two common algorithms and ML-based evaluation methods is not ideal, and their positive samples and negative samples show a certain degree of crossover.

The EvaS dataset was constructed to evaluate our code summary evaluation method alongside other popular methods. As previously mentioned, we use both positive and negative samples to test the effectiveness of these evaluation methods by assessing their ability to distinguish between the two. The performance of existing methods, illustrated in Figure 8, shows varying degrees of crossover between positive and negative samples for BLEU, METEOR, word2vec, and MoverScore. These four methods represent the current approaches in words’ overlap measure(BLEU and METEOR) and words’ embedding measure(word2vec and MoverScore), respectively.

In Figure 8, we have established two types of decision boundaries: one being orthogonal to the X-axis (i.e., distinguishing between positive and negative samples by setting a threshold), and the other being a linear function forming a slanted line. The same process is represented by BLEURT-sum as shown in Figure 9, which shows the superiority of our method.

Refer to caption
Figure 9: BLEURT-sum performance. Compared with other methods, BLEURT-sum has significantly better ability to distinguish between positive and negative samples.

Specifically, we calculated an F1-score for these six methods (ROUGE-L is not shown in the Figure 8) on the positive and negative sample classification task of code summary sentences to measure their ability. Our method achieved an F1-score exceeding 0.9999, significantly outperforming all other evaluation methods. Among the existing methods, METEOR performed the best with an F1-score of 0.9811, while BLEU had the lowest performance at only 0.85 as the lowest. None of the other methods achieved an F1-score above 0.95. Notably, our dataset was not meticulously curated; the negative samples were entirely random sentences with different meanings, which differ from real-world scenarios where the differences in meaning might be subtler. An F1-score below 95% may therefore indicate an unacceptable tendency to misclassify. This is because a 95% F1-score means that approximately 1 in 10 samples are incorrectly judged while upholding a high recall rate. As an evaluation method, the misjudgment can be further amplified by the model results evaluated using the evaluation scheme in the following works.

6.5 RQ4. Real World Experiments

6.5.1 Human Evaluation Experiments

So far, we have evaluated Malsight’s performance at the function level. However, this approach did not allow us to fully showcase the capabilities of the dynamic annotation module. Additionally, real-world malware predominantly manifests in executable files. Hence, we manually analyzed 10 real-world malware samples and selected 79 critical functions out of them, three experienced reverse engineers add summaries to these functions based on discussion. The complete workflow of Malsight was applied to these functions and the outputs was evaluated using both BLEURT-sum and human evaluation metrics. We further calculated the variance deviation of all calculated results as a reference to demonstrate the stability of our solution and the time required for code summarization for a single malware sample.

For the human evaluation metrics, we invited ten evaluators, including five reverse engineers who are rich in experience, three experienced, and two beginners. They were asked to focus on evaluating the usability of the code summarization results and provide a score of 0, 0.5 or 1, indicating whether the summary corresponded to the original code. It is noteworthy that due to the subjectivity of human evaluation of the summaries, different evaluators may have varying opinions on the summary of the same code. Therefore, we opted to evaluate usability only and did not solicit more detailed scores beyond 0,0.5 and 1, respectively corrrepond to usable, partially usable and unusable.

6.5.2 Performance From Different Evaluators

After three sets of scores, we obtained the human assessment scores shown in Table VII. Experienced evaluators tend to give more conservative and stable scores, while less experienced evaluators give a wider range of scores with higher variance. Further, we take the arithmetic average of the scores received by these evaluators as the final human assessment score.

TABLE VII: Evaluation of Different Categories of Evaluators
Rich in experience Experienced Beginner
Score 58.14 59.51 64.73
Variance 173.12 197.49 237.44

Table b shows the mean value of BLEURT-sum and the mean value of human evaluation on 79 labeled functions. The human evaluation score of 59.87 means that most functions have exceeded 0.5, i.e. the partially usable standard.

TABLE VIII: Real Malware Performance
BLEURT-sum Human Assessment Score AVG time (file)
Score 47.22 59.87 1.90
Variance 161.61 0.06 1.21
* AVG time (per file) is measured in hours.
Refer to caption
Figure 10: Real-world Performance. While the ratio of human assessment score to BLEURT-sum remained positive, Malsight is able to obtain higher assessment results.

Further, we formed a data point by combining a BLEURT-sum score with a human evaluation score for the same function’s code summary, thus, we plotted a distribution of 79 data points in Figure 10. It indicates that BLEURT-sum has a positive linear correlation with human evaluation metrics, which shows the rationality of BLEURT-sum. Furthermore, the data distribution indicates that, across the two distinct evaluation metrics, the majority of the evaluation scores for Malsight’s outputs are concentrated within the middle to high range, which shows that our scheme is also usable in the real world.

7 Case Study

In section RQ4, we evaluate Malsight’s performance on 10 real-world malware. To provide more direct evidence and analyze the reasons why Malsight behaves differently on different functions, three functions are presented in this section to discuss the process that Malsight goes through behind the output that is usable, the output that is partially usable, and the output that is unusable.

Refer to caption
Figure 11: Redacted pseudocode in Case 1.

Case 1: sub_401021.

  • Malsight Summary: Set privilege, the code snippet is difficult to ascertain the function’s purpose, but it likely retrieves or obtains professional information or status.

  • Human Summary: Escalating the privileges of the virus process.

Malsight’s summary of the function shown in Figure 11 is considered usable. In this case, Malsight summary accurately identifies “setting privileges” as the key point from a lengthy function, which aligns with the task of “privilege escalation” in the human summary.

Refer to caption
Figure 12: Redacted pseudocode in Case 2.

Case 2: sub_401D30.

  • Malsight Summary: Convert a string to a wide character, the code defines a function that performs string operations, handles exceptional conditions, and creates new objects, potentially Get the server name, the code defines a function that opens a WinHttp connection, connects to a server, and handles error messages and string operations Get the next character from the string, the code defines a function sub_401470 that performs string operations, handles exceptional conditions, and Delete the server.

  • Human Summary: Connect to a malicious domain and make a request to retrieve data using HTTP.

Malsight’s summary of the function shown in Figure 12 is considered partially usable. Malsight summary identifies that the function uses an HTTP connection to a specific server and operates on the data, which is consistent with the human summary. However, due to the lack of some background information, it fails to recognize that the specific server is malicious.

Refer to caption
Figure 13: Redacted pseudocode in Case 3.

Case 3: TimerFunc.

  • Malsight Summary: Get the command line, the code defines a function that handles or processes data related to a hexadecimal address, retrieves information about Redis commands and modules, Call the NtTib. ExceptionList, the code defines a function that handles or processes data related to a hexadecimal address, retrieves information about Copy the setup.exe, the code defines a function that handles or processes data related to a hexadecimal address, retrieves information about Redis commands and modules Setup the setup, the code defines a function that handles or processes data related to a hexadecimal address, retrieves information about Redis commands and modules.

  • Human Summary: Checking if the autorun.ini file exists; if not, create it and configure it to auto-run setup.exe as specified in the configuration file.

Malsight’s summary of the function shown in Figure 13 is considered unusable. The function is very long, contains numerous function calls, and includes many redundant functions that hinder analysis, resulting in a poor Malsight summary.

8 Discussion

8.1 Ethics

The construction of the dataset MalS and MalP complies with GitHub’s open-source license agreements. We only used open-source repositories with GitHub open-source licenses and downloaded these open-source codes within the rate limit set by GitHub to minimize interference with GitHub’s servers to the lowest extent possible. Additionally, We enlisted volunteers to review the MalS and MalP datasets, as well as the 10 malware samples mentioned in RQ4. With the volunteers’ consent, we adopted their review results as the final dataset.

8.2 Limitations

In this section, we delve into practical issues based on our analysis of the experimental results. Additionally, we explore aspects not covered in our work and propose potential solutions.

Real-World Malware vs. Malware Function Summaries: In our exploration of real-world malware, the BLEURT-sum evaluation score of the code summary was approximately 30% lower than the experimental score obtained by MalP in Section 6.2.

The reason could be that our code summarization model lacks the ability to summarize longer functions which occasionally appear in real-world malware. Our analysis shows that about 15% of the functions in the malware contain more than 1000 tokens (while the majority of the functions in the training set stay below 300 tokens), which is likely due to different optimization configurations during compilation. These lengthy function bodies introduce a lot of information and noise, which makes it difficult for the model to extract and summarize the critical code fragments.

Therefore, to better align with real malicious code summaries, researchers should consider doing code summarization work in units of code fragments instead of functions. However, identifying and summarizing important, human-interpretable segments in them requires large amounts of labeled data, or compositional functional insights from dynamic debugging, which poses significant challenges.

Summary for assembly language rather than pseudocode: Efforts have been made to recover information from assembly code and to generate function summaries [60]. A common perspective is that decompiled assembly code suffers varying degrees of information loss during the generation of pseudocode as Intermediate Representation (IR), depending on the disassembly algorithm used. Thus, starting directly from assembly language is considered a viable solution.

In our comparison of assembly code and pseudocode representations, we found that assembly code is challenging for models pre-trained on high-level languages to understand [32]. The structural features of assembly code are completely different from those of high-level languages. Fine-tuning a model using pseudocode can confuse the model’s understanding of function-level structural features, leading to unacceptable error output, possibly due to insufficient datasets.

We train an embedding model specifically for assembly language and build a large dataset at the assembly language level for malware. This method has the potential to outperform general-purpose large models, which typically have a poor understanding of assembly code.

9 Conclusion

In this paper, we introduced Malsight, a novel framework for binary malware summarization by exploring malicious source code and benign pseudocode. The proposed Malsight involves two datasets MalS and MalP, an LLM-based summary model, and an evaluation metrics. Experimental results on three datasets show the effectiveness of the proposed framework. Future work includes the application of the proposed framework to more downstream tasks.

References

  • [1] “Malware statistics &trends report — av-test,” https://www.av-test.org/en/statistics/malware/, accessed: 2024-06-2.
  • [2] X. Shang, S. Cheng, G. Chen, Y. Zhang, L. Hu, X. Yu, G. Li, W. Zhang, and N. Yu, “How far have we gone in stripped binary code understanding using large language models,” arXiv preprint arXiv:2404.09836, 2024.
  • [3] A. Jain, S. Soner, and A. Gadwal, “Reverse engineering: Journey from code to design,” in Proc. of ICECT, 2011.
  • [4] F. A. Aboaoja, A. Zainal, F. A. Ghaleb, B. A. S. Al-rimy, T. A. E. Eisa, and A. A. H. Elnour, “Malware detection issues, challenges, and future directions: A survey,” Applied Sciences, vol. 12, no. 17, 2022.
  • [5] C. Beaman, A. Barkworth, T. D. Akande, S. Hakak, and M. K. Khan, “Ransomware: Recent advances, analysis, challenges and future research directions,” Computers & Security, vol. 111, p. 102490, 2021.
  • [6] M. Yao, J. Fuller, R. P. Sridhar, S. Agarwal, A. K. Sikder, and B. Saltaformaggio, “Hiding in plain sight: an empirical study of web application abuse in malware,” in Proc. of USENIX Security, 2023.
  • [7] D. Gibert, C. Mateu, and J. Planes, “Hydra: A multimodal deep learning framework for malware classification,” Computers & Security, vol. 95, p. 101873, 2020.
  • [8] R. Labaca-Castro, B. Biggio, and G. Dreo Rodosek, “Poster: Attacking malware classifiers by crafting gradient-attacks that preserve functionality,” in Proc. of CCS, 2019.
  • [9] A. Marcelli, M. Graziano, X. Ugarte-Pedrero, Y. Fratantonio, M. Mansouri, and D. Balzarotti, “How machine learning is solving the binary function similarity problem,” in Proc. of USENIX Security, 2022.
  • [10] Z. Yu, R. Cao, Q. Tang, S. Nie, J. Huang, and S. Wu, “Order matters: Semantic-aware neural networks for binary code similarity detection,” in Proc. of AAAI, 2020.
  • [11] Z. Liu, “Binary code similarity detection,” in Proc. of ASE, 2021.
  • [12] X. Deng and J. Mirkovic, “Malware analysis through high-level behavior,” in Proc. of CSET, 2018.
  • [13] B. Cornelissen, A. Zaidman, A. Van Deursen, L. Moonen, and R. Koschke, “A systematic survey of program comprehension through dynamic analysis,” IEEE Transactions on Software Engineering, vol. 35, no. 5, pp. 684–702, 2009.
  • [14] M. Kim, H. Cho, and J. H. Yi, “Large-scale analysis on anti-analysis techniques in real-world malware,” IEEE Access, pp. 75 802–75 815, 2022.
  • [15] H. Rays. State-of-the-art binary code analysis solutions. [Online]. Available: https://www.hex-rays.com/products/ida/
  • [16] NationalSecurityAgency. Ghidra is a software reverse engineering (sre) framework. [Online]. Available: https://github.com/NationalSecurityAgency/ghidra
  • [17] G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker, “Towards automatically generating summary comments for java methods,” in Proceedings of the 25th IEEE/ACM international conference on Automated software engineering, 2010, pp. 43–52.
  • [18] P. W. McBurney and C. McMillan, “Automatic documentation generation via source code summarization of method context,” in Proceedings of the 22nd International Conference on Program Comprehension, 2014, pp. 279–290.
  • [19] A. Al-Kaswan, T. Ahmed, M. Izadi, A. A. Sawant, P. Devanbu, and A. van Deursen, “Extending source code pre-trained language models to summarise decompiled binarie,” in Proc. of SANER, 2023.
  • [20] J. Xiong, G. Chen, K. Chen, H. Gao, S. Cheng, and W. Zhang, “Hext5: Unified pre-training for stripped binary code information inference,” in Proc. of ASE, 2023.
  • [21] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” in Proc. of ICLR, 2023.
  • [22] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large language models with evol-instruct,” arXiv preprint arXiv:2306.08568, 2023.
  • [23] Y. Wang, H. Le, A. Gotmare, N. Bui, J. Li, and S. Hoi, “CodeT5+: Open code large language models for code understanding and generation,” in Proc. of EMNLP, 2023.
  • [24] A. Mastropaolo, M. Ciniselli, M. Di Penta, and G. Bavota, “Evaluating code summarization techniques: A new metric and an empirical characterization,” in Proc. of ICSE, 2024.
  • [25] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proc. of ACL, 2002.
  • [26] A. Lavie and M. Denkowski, “The meteor metric for automatic evaluation of machine translation,” Machine Translation, vol. 23, pp. 105–115, 2009.
  • [27] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. of ACL, 2004.
  • [28] Fortinet. What is malware analysis? types and stages of malware analysis. [Online]. Available: https://www.fortinet.com/resources/cyberglossary/malware-analysis
  • [29] K. Yakdan, S. Dechand, E. Gerhards-Padilla, and M. Smith, “Helping johnny to analyze malware: A usability-optimized decompiler and malware analysis user study,” in Proc. of SP, 2016.
  • [30] Hex-Rays. Decompilation vs disassembly. [Online]. Available: https://hex-rays.com/decompiler/decompilation_vs_disassembly/
  • [31] R. Team. Radare2: a reverse engineering framework. [Online]. Available: https://github.com/radareorg/radare2
  • [32] H. Tan, Q. Luo, J. Li, and Y. Zhang, “Llm4decompile: Decompiling binary code with large language models,” arXiv preprint arXiv:2403.05286, 2024.
  • [33] K. Pal, A. Bajaj, P. Banerjee, A. Dutcher, M. Nakamura, Z. Basque, H. Gupta, S. Sawant, U. Anantheswaran, Y. Shoshitaishvili, A. Doupe, C. Baral, and R. Wang, “;len or index or count, anything but v1&quot;: Predicting variable names in decompilation output with transfer learning,” in Proc. of SP, 2024.
  • [34] Microsoft Corporation. Main() and command-line arguments - C#. [Online]. Available: https://learn.microsoft.com/en-us/dotnet/csharp/fundamentals/program-structure/main-command-line
  • [35] E. C. R. Shin, D. Song, and R. Moazzezi, “Recognizing functions in binaries with neural networks,” in Proc. of USENIX Security, 2015.
  • [36] D. M. Berris, A. Veitch, N. Heintze, E. Anderson, and N. Wang, “Xray: A function call tracing system,” Technical report, 2016. A white paper on XRay, a function call tracing system developed at Google, 2016.
  • [37] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of NAACL-HLT, 2019.
  • [38] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
  • [39] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” in Proc. of EMNLP, 2020.
  • [40] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proc. of EMNLP, 2021.
  • [41] J. Jiang, Y. Shu, J. Wang, and M. Long, “Transferability in deep learning: A survey,” 2022.
  • [42] S. Haque, Z. Eberhart, A. Bansal, and C. McMillan, “Semantic similarity metrics for evaluating source code summarization,” in Proc. of ICPC, 2022.
  • [43] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  • [44] W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger, “MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance,” in Proc. of EMNLP-IJCNLP, 2019.
  • [45] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger, “From word embeddings to document distances,” in Proc. of ICML, 2015.
  • [46] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” CoRR, vol. abs/1802.05365, 2018.
  • [47] T. Sellam, D. Das, and A. Parikh, “BLEURT: Learning robust metrics for text generation,” in Proc. of ACL, 2020.
  • [48] L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “Bodmas: An open dataset for learning based temporal analysis of pe malware,” in Proc. of DLS, 2021.
  • [49] M. O. F. Rokon, R. Islam, A. Darki, E. E. Papalexakis, and M. Faloutsos, “SourceFinder: Finding malware Source-Code from publicly available repositories in GitHub,” in Proc. of RAID, 2020.
  • [50] W. Zhu, Z. Feng, Z. Zhang, J. Chen, Z. Ou, M. Yang, and C. Zhang, “Callee: Recovering call graphs for binaries with transfer and contrastive learning,” in Proc. of SP, 2023.
  • [51] R. Tarjan, “Depth-first search and linear graph algorithms,” in Proc. of SWAT, 1971.
  • [52] E. W. Dijkstra, “A note on two problems in connexion with graphs,” Numer. Math., vol. 1, no. 1, p. 269–271, 1959.
  • [53] GitHub. Github code search. [Online]. Available: https://docs.github.com/en/github/searching-for-information-on-github/searching-code
  • [54] ——. The technology behind github’s new code search. [Online]. Available: https://github.blog/2023-02-06-the-technology-behind-githubs-new-code-search/
  • [55] T. Ye, L. Wu, T. Ma, X. Zhang, Y. Du, P. Liu, S. Ji, and W. Wang, “CP-BCS: Binary code summarization guided by control flow graph and pseudo code,” in Proc. of EMNLP, 2023.
  • [56] OpenAI. ChatGPT: Optimizing Language Models for Dialogue. [Online]. Available: {https://chat.openai.com/}
  • [57] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, Q. Lin, and D. Jiang, “WizardLM: Empowering large pre-trained language models to follow complex instructions,” in Proc. of ICLR, 2024.
  • [58] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” 2024.
  • [59] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023.
  • [60] X. Li, Y. Qu, and H. Yin, “Palmtree: Learning an assembly language model for instruction embedding,” in Proc. of CCS, 2021.

Appendix A Analysis of Traditional Similarity Evaluation Metric

The criteria for evaluating sentence similarity should focus on two key aspects: similar sentences should correspond to higher scores, while dissimilar sentences should correspond to lower scores. Below, we introduce some of the shortcomings of BLEU and other traditional metrics (ROUGE, METEOR) in these aspects.

A.1 Limitations of Overlap Based Methods

Problems caused by BLEU algorithm: The formula BLEU uses to calculate sentence similarity is shown in equation 17:

BLEU=BPexp(n=1Nwnln(Precisionn))𝐵𝐿𝐸𝑈𝐵𝑃𝑒𝑥𝑝superscriptsubscript𝑛1𝑁subscript𝑤𝑛𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛𝑛BLEU=BP*exp(\sum\limits_{n=1}^{N}w_{n}\ln(Precision_{n}))italic_B italic_L italic_E italic_U = italic_B italic_P ∗ italic_e italic_x italic_p ( ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT roman_ln ( italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) (17)

Where BP𝐵𝑃BPitalic_B italic_P is a brevity penalty factor, wnsubscript𝑤𝑛w_{n}italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the weight of the n𝑛nitalic_n-gram, and Precisionn𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛𝑛Precision_{n}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the precision of the generated candidate sentence. Specifically, equation 18 shows the construction of Precisionn𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛𝑛Precision_{n}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where candidate&reference𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒candidate\&referenceitalic_c italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e & italic_r italic_e italic_f italic_e italic_r italic_e italic_n italic_c italic_e is the number of overlapping occurrences between the candidate sentence and the reference sentence.

Precisionn=len(candidate&reference)+1len(candidate)+1𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜subscript𝑛𝑛𝑙𝑒𝑛𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒1𝑙𝑒𝑛𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒1Precision_{n}=\frac{len(candidate\&reference)+1}{len(candidate)+1}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG italic_l italic_e italic_n ( italic_c italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e & italic_r italic_e italic_f italic_e italic_r italic_e italic_n italic_c italic_e ) + 1 end_ARG start_ARG italic_l italic_e italic_n ( italic_c italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e ) + 1 end_ARG (18)

In this context, the Add-One Smoothing method is used to avoid zero-count problems when the n-gram size is large. However, this introduces another issue: even if the candidate sentence and reference sentence are completely unrelated, this smoothing method still produces a certain score. This effect is particularly noticeable in the case of short sentences.

We conducted experimental tests for this scenario, testing each reference-candidate pair within the sentence length range of [1,30]. Each sentence pair had zero word overlap, and our results are shown in Figure 14.

Refer to caption
Figure 14: Bleu Score When Zero Overlap. When two sentences have no overlap, due to the structural flaws in the BLEU algorithm, they still receive a score, indicating a bias toward shorter sentences.

Even when sentence pairs are completely mismatched, BLEU scores greater than 0.3 can occur for shorter sentences. This significant deviation from reality indicates that BLEU’s scoring is distorted for short sentences in some cases.

ROUGE & METEOR, The flaw of calculating similarity in basic units of words: ROUGE and METEOR have something in common in the construction of sentence similarity evaluation algorithms. ROUGE follows the following equation 19.

ROUGEL=FLCS=(1+β2)RLCSPLCSRLCS+β2PLCS𝑅𝑂𝑈𝐺𝐸𝐿subscript𝐹𝐿𝐶𝑆1superscript𝛽2subscript𝑅𝐿𝐶𝑆subscript𝑃𝐿𝐶𝑆subscript𝑅𝐿𝐶𝑆superscript𝛽2subscript𝑃𝐿𝐶𝑆ROUGE-L=F_{LCS}=\frac{(1+\beta^{2})R_{LCS}P_{LCS}}{R_{LCS}+\beta^{2}P_{LCS}}italic_R italic_O italic_U italic_G italic_E - italic_L = italic_F start_POSTSUBSCRIPT italic_L italic_C italic_S end_POSTSUBSCRIPT = divide start_ARG ( 1 + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_R start_POSTSUBSCRIPT italic_L italic_C italic_S end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_L italic_C italic_S end_POSTSUBSCRIPT end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_L italic_C italic_S end_POSTSUBSCRIPT + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_L italic_C italic_S end_POSTSUBSCRIPT end_ARG (19)

Where RLCS=LCS(C,S)len(S)subscript𝑅𝐿𝐶𝑆𝐿𝐶𝑆𝐶𝑆𝑙𝑒𝑛𝑆R_{LCS}=\frac{LCS(C,S)}{len(S)}italic_R start_POSTSUBSCRIPT italic_L italic_C italic_S end_POSTSUBSCRIPT = divide start_ARG italic_L italic_C italic_S ( italic_C , italic_S ) end_ARG start_ARG italic_l italic_e italic_n ( italic_S ) end_ARG,PLCS=LCS(C,S)len(C)subscript𝑃𝐿𝐶𝑆𝐿𝐶𝑆𝐶𝑆𝑙𝑒𝑛𝐶P_{LCS}=\frac{LCS(C,S)}{len(C)}italic_P start_POSTSUBSCRIPT italic_L italic_C italic_S end_POSTSUBSCRIPT = divide start_ARG italic_L italic_C italic_S ( italic_C , italic_S ) end_ARG start_ARG italic_l italic_e italic_n ( italic_C ) end_ARG and β𝛽\betaitalic_β is used to give weight to recall rates. LCS(C,S)𝐿𝐶𝑆𝐶𝑆LCS(C,S)italic_L italic_C italic_S ( italic_C , italic_S ) is used to calculate the length of the common substring of two strings C𝐶Citalic_C and S𝑆Sitalic_S. Subjectively, when two target sentences have a higher degree of overlap in a specific word, they are given a higher ROUGE score.

METEOR designs on the basis of rouge, following equation 20, 21 and 22.

Fmean=(1+β2)PRR+βPsubscript𝐹𝑚𝑒𝑎𝑛1superscript𝛽2𝑃𝑅𝑅𝛽𝑃F_{mean}=\frac{(1+\beta^{2})PR}{R+\beta P}italic_F start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT = divide start_ARG ( 1 + italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_P italic_R end_ARG start_ARG italic_R + italic_β italic_P end_ARG (20)
Penalty=γ(chunksunigrams_matched)θ𝑃𝑒𝑛𝑎𝑙𝑡𝑦𝛾superscript𝑐𝑢𝑛𝑘𝑠𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠_𝑚𝑎𝑡𝑐𝑒𝑑𝜃Penalty=\gamma(\frac{chunks}{unigrams\_matched})^{\theta}italic_P italic_e italic_n italic_a italic_l italic_t italic_y = italic_γ ( divide start_ARG italic_c italic_h italic_u italic_n italic_k italic_s end_ARG start_ARG italic_u italic_n italic_i italic_g italic_r italic_a italic_m italic_s _ italic_m italic_a italic_t italic_c italic_h italic_e italic_d end_ARG ) start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT (21)
METEOR=Fmean(1Penalty)𝑀𝐸𝑇𝐸𝑂𝑅subscript𝐹𝑚𝑒𝑎𝑛1𝑃𝑒𝑛𝑎𝑙𝑡𝑦METEOR=F_{mean}(1-Penalty)italic_M italic_E italic_T italic_E italic_O italic_R = italic_F start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT ( 1 - italic_P italic_e italic_n italic_a italic_l italic_t italic_y ) (22)

Where P=nlen(candidate)𝑃𝑛𝑙𝑒𝑛𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒P=\frac{n}{len(candidate)}italic_P = divide start_ARG italic_n end_ARG start_ARG italic_l italic_e italic_n ( italic_c italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e ) end_ARG,R=nlen(reference)𝑅𝑛𝑙𝑒𝑛𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒R=\frac{n}{len(reference)}italic_R = divide start_ARG italic_n end_ARG start_ARG italic_l italic_e italic_n ( italic_r italic_e italic_f italic_e italic_r italic_e italic_n italic_c italic_e ) end_ARG and n𝑛nitalic_n are the number of words where the candidate sentence and reference sentence overlap. METEOR employs exact matching, stem matching, and WordNet-based synonym matching to address the issue of words with identical meanings not being recognized as overlaps. Consequently, METEOR outperformed ROUGE in our experiments, ranking second only to BLEURT-sum. However, algorithms that rely solely on word overlap can still misjudge due to structural similarities in sentences or phrases. For instance, in code summarization, sentences might share terms like ”function”, ”aims”, or ”code,” or convey the same idea using different wording, such as ”compare two sentences” versus ”bitwise and return true/false.”

Interestingly, we found that expanding the stopword list appropriately can enhance the performance of word overlap-based methods, especially METEOR.

A.2 Limitations of Embedding Based Methods

In recent years, using word embedding-based methods such as word2vec to determine sentence similarity has become quite popular. For instance, the pseudocode for the word2vec process is illustrated in Algorithm 1.

Algorithm 1 word2vec
1:The reference sentence ref𝑟𝑒𝑓refitalic_r italic_e italic_f and the candidate sentence cad𝑐𝑎𝑑caditalic_c italic_a italic_d
2:the similarity score𝑠𝑐𝑜𝑟𝑒scoreitalic_s italic_c italic_o italic_r italic_e of ref𝑟𝑒𝑓refitalic_r italic_e italic_f and cad𝑐𝑎𝑑caditalic_c italic_a italic_d
3:function Main(cad,ref𝑐𝑎𝑑𝑟𝑒𝑓cad,refitalic_c italic_a italic_d , italic_r italic_e italic_f)
4:     cad.vector = getVector(cad𝑐𝑎𝑑caditalic_c italic_a italic_d)
5:     ref.vector = getVector(ref𝑟𝑒𝑓refitalic_r italic_e italic_f)
6:     score=cadrefcadrefscorecadrefnormcadnormref\text{score}=\frac{\text{cad}\cdot\text{ref}}{||\text{cad}||\cdot||\text{ref}||}score = divide start_ARG cad ⋅ ref end_ARG start_ARG | | cad | | ⋅ | | ref | | end_ARG
7:     return score
8:end function
9:function getVector(sentence𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒sentenceitalic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e)
10:     words = tokenize(sentence)
11:     sentence.vector = [0,0,..]
12:     for wordwordswordwords\text{word}\in\text{words}word ∈ words do
13:         sentence.vector += word.vector
14:     end for
15:     return sentence.vector
16:end function

When calculating the similarity between two sentences, word2vec adds the vectors of each word based on the tokenized results and uses the summed vector as the sentence representation. Finally, cosine similarity is employed to determine the similarity between the sentences. A direct drawback of this approach is that it loses all word order information and heavily depends on the accuracy of the embedding algorithm.

Appendix B Reverse CFG Sorting Algorithm

Algorithm 2 REsort
1:The CFG graph GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT for a malware binary file.
2:The reverse topsort list LGsubscript𝐿𝐺L_{G}italic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT of the function call graph.
3:function REsort(GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT)
4:     for each vertex v𝑣vitalic_v in GM.verticesformulae-sequencesubscript𝐺𝑀𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠G_{M}.verticesitalic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT . italic_v italic_e italic_r italic_t italic_i italic_c italic_e italic_s do
5:         if !v.seen!v.seen! italic_v . italic_s italic_e italic_e italic_n then
6:              new GMsuperscriptsubscript𝐺𝑀G_{M}^{\prime}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
7:              GM.verticesformulae-sequencesuperscriptsubscript𝐺𝑀𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠absentG_{M}^{\prime}.vertices\leftarrowitalic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_v italic_e italic_r italic_t italic_i italic_c italic_e italic_s ← call Tarjan(v𝑣vitalic_v, G𝐺Gitalic_G)
8:              //Tarjan() traverses all strongly connected components in GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT and contracts vertices as GM.verticesformulae-sequencesuperscriptsubscript𝐺𝑀𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠G_{M}^{\prime}.verticesitalic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_v italic_e italic_r italic_t italic_i italic_c italic_e italic_s.
9:         end if
10:     end for
11:     GMsuperscriptsubscript𝐺𝑀absentG_{M}^{\prime}\leftarrowitalic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← call BuildTarGraph(GM.verticesformulae-sequencesuperscriptsubscript𝐺𝑀𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠G_{M}^{\prime}.verticesitalic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_v italic_e italic_r italic_t italic_i italic_c italic_e italic_s, GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT)
12:     //BuildTarGraph() build a new directed acyclic graph GMsuperscriptsubscript𝐺𝑀G_{M}^{\prime}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using GM.verticesformulae-sequencesuperscriptsubscript𝐺𝑀𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠G_{M}^{\prime}.verticesitalic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_v italic_e italic_r italic_t italic_i italic_c italic_e italic_s and GM.edgesformulae-sequencesubscript𝐺𝑀𝑒𝑑𝑔𝑒𝑠G_{M}.edgesitalic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT . italic_e italic_d italic_g italic_e italic_s
13:     LGMsubscript𝐿superscriptsubscript𝐺𝑀absentL_{G_{M}^{\prime}}\leftarrowitalic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← call RetopSort(GMsuperscriptsubscript𝐺𝑀G_{M}^{\prime}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)
14:     //RetopSort() obtains the reverse topological sorting sequence of GMsuperscriptsubscript𝐺𝑀G_{M}^{\prime}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
15:     dist[]𝑑𝑖𝑠𝑡absentdist[]\leftarrowitalic_d italic_i italic_s italic_t [ ] ← call Dijkstra(GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT)
16:     //Dijkstra() computes the multi-source shortest paths in graph GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT from vertices with an in-degree of 0 to each vertix.
17:     LGM[]subscript𝐿subscript𝐺𝑀L_{G_{M}}\leftarrow[]italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← [ ]
18:     for each vertex vsuperscript𝑣v^{\prime}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in LGMsubscript𝐿superscriptsubscript𝐺𝑀L_{G_{M}^{\prime}}italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT do
19:         mindist𝑚𝑖𝑛𝑑𝑖𝑠𝑡mindist\leftarrow\inftyitalic_m italic_i italic_n italic_d italic_i italic_s italic_t ← ∞, idx1𝑖𝑑𝑥1idx\leftarrow-1italic_i italic_d italic_x ← - 1
20:         //idx𝑖𝑑𝑥idxitalic_i italic_d italic_x determines the starting vertex for DFS in each strongly connected component in GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT
21:         for each ver𝑣𝑒𝑟veritalic_v italic_e italic_r in v.subvertexformulae-sequencesuperscript𝑣𝑠𝑢𝑏𝑣𝑒𝑟𝑡𝑒𝑥v^{\prime}.subvertexitalic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_s italic_u italic_b italic_v italic_e italic_r italic_t italic_e italic_x do
22:              if dist[ver]<mindist𝑑𝑖𝑠𝑡delimited-[]𝑣𝑒𝑟𝑚𝑖𝑛𝑑𝑖𝑠𝑡dist[ver]<mindistitalic_d italic_i italic_s italic_t [ italic_v italic_e italic_r ] < italic_m italic_i italic_n italic_d italic_i italic_s italic_t then
23:                  mindistdist[ver]𝑚𝑖𝑛𝑑𝑖𝑠𝑡𝑑𝑖𝑠𝑡delimited-[]𝑣𝑒𝑟mindist\leftarrow dist[ver]italic_m italic_i italic_n italic_d italic_i italic_s italic_t ← italic_d italic_i italic_s italic_t [ italic_v italic_e italic_r ], idxver𝑖𝑑𝑥𝑣𝑒𝑟idx\leftarrow veritalic_i italic_d italic_x ← italic_v italic_e italic_r
24:              end if
25:         end for
26:         LGMsubscript𝐿subscript𝐺𝑀L_{G_{M}}italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT.append(DFS(idx𝑖𝑑𝑥idxitalic_i italic_d italic_x, vsuperscript𝑣v^{\prime}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT))
27:         //DFS() derives a traversal order in each strongly connected component in GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT as part of the final reverse topological sorting.
28:     end for
29:end function
TABLE IX: Performance in different training data proportion
Proportion BLEURT-sum BLEU ROUGE-L METEOR AVG summary length BLEURT-sum variance
1:1 59.12 8.98 23.04 16.14 39.83 181.01
1:2 59.81 9.15 23.50 16.42 40.75 165.51
3:4 59.53 8.68 22.95 16.39 42.21 173.39
4:3 62.14 9.80 25.11 16.88 35.27 131.65
* Due to the limited size of MalS, the model cannot be fine-tuned in the training data proportion 2:1.
* AVG summary length is measured in words.

Algorithm 2 presents our function topological sorting algorithm, with the objective of constructing the reverse topological sorting of the function call graph GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. Due to the presence of multi-cycles (i.e., strongly connected components) in GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, it is not feasible to perform reverse topological sorting directly. To address this, we utilize the Tarjan algorithm to contract GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT into a new directed acyclic graph GMsuperscriptsubscript𝐺𝑀G_{M}^{\prime}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where reverse topological sorting can be applied to obtain an initial function traversal list LGMsubscript𝐿superscriptsubscript𝐺𝑀L_{G_{M}^{\prime}}italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, with each node representing a strongly connected component in GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT.

Further, to achieve topological sorting for GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, we employ the DFS algorithm to sort each node within the strongly connected components. Since our aim is to reverse-sort the entire graph GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, with functions closer to the start function preferred towards the end of LGMsubscript𝐿subscript𝐺𝑀L_{G_{M}}italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we utilize the Dijkstra algorithm to compute the multi-source shortest paths dist[]𝑑𝑖𝑠𝑡dist[]italic_d italic_i italic_s italic_t [ ] from all nodes with zero in-degree to other nodes (referred to as ’multi-source’ because the function call graph GMsubscript𝐺𝑀G_{M}italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT may not necessarily be connected). Subsequently, by traversing LGMsubscript𝐿superscriptsubscript𝐺𝑀L_{G_{M}^{\prime}}italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT sequentially, for each strongly connected component, the node with the minimum dist𝑑𝑖𝑠𝑡distitalic_d italic_i italic_s italic_t value is selected as the initial node idx𝑖𝑑𝑥idxitalic_i italic_d italic_x for DFS. Leveraging the inherent stack property of DFS, idx𝑖𝑑𝑥idxitalic_i italic_d italic_x ensures its placement at the end of LGMsubscript𝐿subscript𝐺𝑀L_{G_{M}}italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Finally, LGMsubscript𝐿subscript𝐺𝑀L_{G_{M}}italic_L start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT represents the desired function traversal order.

Appendix C Performance in different proportion of two-phase fine-tuning training data size

We sampled a total of 140,000 pieces of data in different proportions from the two datasets (MalS and BenignC) for two phases of fine-tuning. The results of the evaluation of model performance in each training data proportion are shown in the Table IX. In the experimental part of the text, we choose the ratio of 4:3 with the best effect as the actual training set ratio of Malsight.