Malsight: Exploring Malicious Source Code and Benign Pseudocode for Iterative
Binary Malware Summarization

Haolang Lu^§, Hongrui Peng^§, Guoshun Nan^*,
Jiaoyang Cui, Cheng Wang, Weifei Jin Beijing University of Posts and Telecommunications, Beijing, China
lhl_2507@bupt.edu.cn, penghongruif@bupt.edu.cn, nanguo2021@bupt.edu.cn,
skyboard@bupt.edu.cn, wang.me@bupt.edu.cn, weifeijin@bupt.edu.cn

Abstract

Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations, and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored.

To this end, we propose Malsight, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summaries, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS dataset and a benign pseudocode dataset. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting the usability, accuracy, and completeness of summaries. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed Malsight. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger ChatGPT3.5.

Index Terms:

Malware, Code Summarization, Binary Code

1 Introduction

The AV-TEST Institute [1] recently reported that over 450,000 new malicious files and potentially unwanted applications are registered daily, showing a high demand for malware understanding. Binary malware summarization [2] is a reverse engineering [3] task that aims to automatically generate concise human-readable descriptions of binary executable malicious files. The summarization provides security analysts with a quick understanding of the malware’s functionality and patterns when source code is unavailable, thereby benefiting a wide range of applications such as malware cracking [4] [5] [6], malware family classification [7] [8], binary code similarity detection [9] [10] [11], and large-scale malware behavior analysis [12] [13] [14].

Refer to caption — Figure 1: The comparison of source code (left) and its pseudocode (right). The pseudocode includes significantly more content and a more complex structure, and it also strips key semantic cues such as function names.

Existing reverse engineering tools, such as IDA [15] and Ghidra [16], can decompile executables into higher-level C-like pseudocode, while they still lack easy-to-understand semantics information. Consequently, a line of efforts attempts to generate human-readable summaries based on pseudocode. Early studies rely on manual parsing or rule-based summary generation [17] [18]. Recent large language models (LLMs), such as BinT5 [19], HexT5 [20], CodeGen [21], and WizardCoder [22], have shown great potential to produce more informative summaries. However, these data-driven approaches still face critical issues, including poor usability, inaccurate explanations, and inaccurate explanations [2]. Figure 1 shows the underlying reasons for the above issues by comparing the source code of the function “initLevel” to the corresponding pseudocode. We observe that the pseudocode presents 1) significantly more content that increases from 20 lines in source to 117 lines in the pseudocode, 2) a more complex and obscure structure with multi-level nesting and entangled logic. The pseudocode involves 29 more calls and 29 more if statements compared to the source code at the left, 3) stripping key semantic cues such as variable names and function names. For example, the function “initLevel” in source code is transferred to a meaningless symbol “sub_404018”.

To address the above challenges, we present Malsight, a novel binary malware summarization framework that can iteratively generate descriptions of executable malware by exploring malicious source code and benign pseudocode. The proposed Malsight involves three key ingredients, including a malware dataset MalS, an LLM-based malware summarization model MalT5, and an evaluation metric BLEURT-sum. We describe the workflow of the proposed framework in four steps as follows.

Constructing MalS: As an LLM-based summarization model heavily relies on high-quality annotations to learn to align with domain-specific knowledge, it necessitates high-quality malware pseudocode summaries to fine-tune the LLM. However, the public malware pseudocode summarization dataset is unavailable so far, and building such a benchmark is quite challenging as it requires huge human involvement for accurate annotations. Figure 1 illustrates three challenges of understanding malware pseudocode. To tackle this issue, we alternatively construct MalS, a large-scale summarization dataset using an LLM model, and malicious C language source code crawled from GitHub. The proposed MalS involves nearly 90,000 malware source functions, with 20 types of malware functions. We also construct a small dataset MalP for testing. We detail such a procedure in Section 4.3.

Training MalT5: We use CodeT5+ [23] as the foundation model of our MalT5. We sequentially fine-tune the proposed MalT5 model on the MalS dataset and an existing benign pseudocode summarization dataset [19]. The underlying intuition is that the malicious semantic knowledge from malware source code summarization and function patterns from benign pseudocode summarization, which are learned from the above two datasets, respectively, can be transferred to the generation of malware pseudocode. By doing so, we can properly mitigate the issue of unavailable malware pseudocode summarization datasets. More details are available in Section 4.4.

Performing Generation: We use an existing tool [15] to generate pseudocode of a binary file and then generate summaries using the MalT5 model. We first use IDA to construct the malware call graph and then develop an algorithm to transform the graph into a function list in reverse order. Then we iteratively fed the first function in the list to MALT5 to generate the summary. More details are provided in Section 4.1 and 4.2.

Conducting Evaluation: Previous work [24] indicated that existing metrics for generation tasks, such as Bilingual Evaluation Understudy (BLEU) [25], Metric for Evaluation of Translation with Explicit ORdering (METEOR) [26], Recall-Oriented Understudy for Gisting Evaluation-Longest Common Subsequence (ROUGE-L) [27], may not well-fit for evaluation of the binary malware summarization. We thus employ BLEURT-sum, which is more sensitive to the quality of the pseudocode summary, thereby benefiting the evaluation in practice. More descriptions are given in Section 5.1.

We conduct experiments on three datasets to verify the effectiveness of the proposed Malsight framework for binary malware summarization. The contribution of this paper can be summarized as follows¹¹1We will release our Malsight to contribute to the community..

•

A binary malware summarization Framework. We propose Malsight, a novel framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Our MalT5 can tackle the challenges of entangled logic and stripped semantics in pseudocode.
•

Large-scale datasets for binary malware summarization. We propose MalS and MalP, two novel datasets that can be used for the LLM training and testing of an LLM of binary malware summarization. To the best of our knowledge, the two datasets are the first in the field, involving nearly 90,000 malicious source functions and 20 types. Our MalS and MalP can serve as a benchmark for various binary malware understanding tasks.
•

An LLM-based binary malware summarization model. We propose MalT5, a novel LLM for the summarization task. The proposed MalT5 is lightweight, with only 0.7B parameters.
•

An evaluation metric for the task: We present BLEURT-sum, a novel evaluation metric that is more sensitive to the quality of pseudocode summarization.
•

Extensive experiments. We conduct extensive experiments on three datasets and provide case studies to show why the proposed framework performs best among all baselines. Results show that our MalT5 achieves comparable performance to ChatGPT3.5.

2 Background

2.1 Malware Analysis Engineering

The field of malware analysis engineering focuses on analyzing the functionality of malware by examining its binaries, typically through static analysis methods that involve observing assembly code or pseudocode [28].

2.1.1 Binary Decompilation

Decompilation [29] converts executable files into human-readable pseudocode [30], which is more concise and structured than disassembled assembly code. Unlike disassembly, which maps instruction encoding directly to assembly statements, decompilation relies on algorithms and patterns (e.g. R2 [31], IDA [15], Ghidra [16]) and emerging methods using LLMs [32]. However, pseudocode lacks semantic information such as function names. Decompiled function names are often unreadable (e.g., sub_4061C0 in IDA Pro) [33], providing a little useful pieces of information for further analysis.

2.1.2 Human Static Analysis

In static analysis, human experts start analyzing from the function entry point [34], inferring functionality from system Application Programming Interface (API) calls, string information, and pseudocode logic. Their main challenge is accurately identifying the core function [35] among numerous functions and methodically tracing the function call [36] process to understand the functionality comprehensively. To assist in this process, we developed Machine Learning-based (ML-based) Malsight, which optimizes and facilitates binary malware analysis.

2.2 NLP Technologies

In the Malsight process, we use Bidirectional Encoder Representation from Transformers (BERT) [37] and Text-to-Text Transfer Transformer (T5) [38] architecture language models to complete specific tasks. For the core code summary task, we build a CodeT5+ model combined with transfer learning.

2.2.1 BERT Family

BERT is a large-scale transformer-based language model pre-trained on a wide corpus of text using a self-supervised learning approach. The design of Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) training tasks makes BERT perform well in the tasks of Sequence Labeling (SL), such as Named Entity Recognition (NER).

Building on BERT, CodeBERT [39] learns code semantics through Code-Conditioned masked language modeling (CMLM) and natural language documentation generation (NLG). CodeBERT has been shown to perform well on code-related tasks, and since it is derived from BERT, we have reason to believe that this model can be fine-tuned to solve the problem of SL in pseudocode.

2.2.2 T5 Family

T5 [38], or Text-to-Text Transfer Transformer, is a sequence-to-sequence model based on the Transformer architecture that unifies various Natural Language Processing (NLP) tasks into a single framework, including text classification, question answering, summarization, translation, and text generation.

CodeT5 [40] is an encoder-decoder model supporting code understanding and generation, built on the T5 architecture. It uses Natural Language-Programming Language (NL-PL) bimodal data for pre-training with identifier tagging and masked identifier prediction tasks. CodeT5+ [23] introduces greater architectural flexibility and additional pre-training tasks, with instruction tuning to enhance alignment with natural language instructions. This results in significant performance improvements on various code-related tasks.

Previous works like HexT5 [20] and BinT5 [19] developed datasets to train models for binary code understanding, including code summarization tasks. These efforts demonstrate the potential of T5-based models in binary code summarization for malware analysis.

2.2.3 Transfer Learning

Transfer learning involves training a model on a source task with abundant labeled data to learn general features. When it is difficult to obtain sufficient datasets for training, transfer learning can be used to supplement them with similar or related datasets [41].

In Malsight, we fine-tune the CodeT5+ [23] model to achieve transfer learning from the source code summarization task to the decompiled code summarization task. Besides, we use dynamic and static annotation to implement feature enhancement to compensate for the poor transfer effect caused by the highly limited similarity between the source code and the stripped decompiled code.

2.3 Code Summary Evaluation

In code summary model evaluation, NLP text similarity algorithms compare generated results with a reference test set, replacing costly human evaluations. These algorithms are categorized into word overlap and word embedding measures.

2.3.1 Words’ Overlap Measure

Early text similarity measures like BLEU [25] and ROUGE [27] rely on word n-gram overlap between generated and reference text, with BLEU focusing on precision and ROUGE on recall. However, they lack semantic understanding. METEOR [26] integrates n-gram overlap and semantic similarity using WordNet, providing additional semantic insight.

Recent work [42] highlights limitations of words’ overlap in code summary tasks. It shows that similar structures may yield high similarity scores despite differing semantics.

2.3.2 Words’ Embedding Measure

The words’ embedding measure evaluates semantic similarity by analyzing the distance between sentence embeddings in a vector space, often utilizing neural network learning.

word2vec [43] is a static embedding model that represents words as points in a vector space, facilitating the proximity of semantically similar words. MoverScore [44] uses an n-gram optimized Word Mover’s Distance (WMD) [45] to measure similarity and employs various embedding models like ELMo [46] and BERT.

BLEURT [47] stands out as a versatile metric designed for assessing various natural language generation tasks, which combines the advantages of both Words’ Overlap Measure and Words’ Embedding Measure. It achieves this by integrating diverse lexical and semantic-level supervision signals into its pre-training process and leveraging synthetic data based on pre-trained BERT, ensuring its effectiveness and versatility in various evaluation scenarios.

3 Motivation and Overview

The construction of the code summary framework mainly includes annotation generation and code summary model construction, as shown in Figure 2. During the evaluation phase, we tested several evaluation methods and found a reasonable way to build an evaluation model for the code summary task. Simultaneously, our work involves the construction of multiple datasets (for subsequent stages of training and evaluation of transfer learning-based models).

3.1 Code Summarization Process

The code summary task is split into three steps, which are function list extraction, annotation generation, and code LLM summary.

3.1.1 Function List Extraction

As mentioned, existing code summary methods for binary focus only on the internal information of the function. We introduced the call relationship between functions and worked on the entire binary as the processing unit. In other words, when function func_E in Figure 2 calls func_F, it is difficult for the subsequent code summarization model to correctly summarize the functionality of func_E without any information about func_F. (We assume that the function name of func_F has been corrupted.) Constructing a list of reverse call sequential relational functions provides a basis for the subsequent recovery of sub-functions functionality.

3.1.2 Annotation Generation

Iterate through the list of functions (assuming Fun_F has been processed), and Fun_E will first be added with annotations by the static annotator and the dynamic annotator, respectively. Fun_E uses the static annotator to obtain static annotations based on the internal information of the function code, while the dynamic annotator adds dynamic annotations based on the generated summary of the sub-function (Func_F) in the function. The program, in other words, sequentially restores functions according to the Control Flow Graph (CFG) diagram from the outermost to the innermost and passes function summary results inward.

3.1.3 Code LLM Summary

Fun_E(annotated) is then fed into the code summary model for final code summary generation. In our work, we use transfer learning to adapt the model to both the functionality of malware functions and the structural features of decompiled pseudocode. Based on the CodeT5+ model, we have fine-tuned the code summary task. The tokenizer splits the code into tokens and embeddings, incorporating a self-attention mechanism into a complete vector in the encoder. The decoder outputs a fine-tuned prediction based on the code summary.

3.2 Evaluation Method

Our research has found that existing methods can not simultaneously measure the meaning, structure, word frequency, and other features of the reference sentence and the candidate sentence, so the model’s performance may be misjudged. Taking Figure 3 as an example, two examples show the evaluation results of BLEU, Meteor, and ROUGE-L on two pairs of real code summaries. The figure shows that two code summaries without any semantically related results in high evaluation scores (blue-framed), while two semantically similar code abstracts receive low scores (red-framed), demonstrating the shortage of existing methods. In the following work, we construct an ML-based code summary evaluation method BLEURT-sum by constructing a set of positive and negative samples composed of related sentence pairs and unrelated sentence pairs. We evaluated the usability of the model and prior art by measuring their ability to distinguish between positive and negative samples.

3.3 Datasets Construction

For the two core tasks mentioned above, binary code summary and code summary model evaluation, we build corresponding datasets.

3.3.1 Dataset For Code Summary Model

In order to avoid the data shift problem, the training of the code summary model requires a large dataset of malware pseudocode summary. Unfortunately, malware datasets are typically represented as collections of compiled binary files [48], with the binary code stripped, and possibly structurally confused. Consequently, the creation of code summary datasets for binary malware could be unfeasible without resorting to labor-intensive manual summarization.

In this paper, our key insight is that the code summary model requires two capabilities, understanding of Malware Functionality and adaptability to disassembled pseudocode formats (including the ability to deal with annotated code). Therefore, We build source-based malware datasets and pseudo-code-based benign software datasets to train the model on these two capabilities separately. Based on Sourcefinder [49], we were able to find malware source repositories from GitHub. We generate descriptive labels for the extracted malware functions using a sophisticated language model, followed by manual verification and optimization. Meanwhile, we use the Capybara dataset provided by BinT5 [19] (a benign software dataset) to train the model adaptability to pseudocode structure.

3.3.2 Dataset For Evaluation Model

In the evaluation phase, the evaluation method is used to measure the similarity between the model generation results and the reference results to evaluate the quality of the model generation. Given two sentences (generation results and the reference results) $S_{g}$ and $S_{r}$ , most evaluation methods output a score $Score$ as the evaluation result. Therefore, if considering the use of machine learning methods, it is necessary to construct a dataset in $\{S_{g},S_{r},Score\}({Score}\in[0,1])$ format. The challenge is that when a dataset of $\{S_{g},S_{r}\}$ is obtained, it is a difficult job to obtain an accurate $Score$ . In our subsequent work, we propose a reasonable algorithmic flow for constructing labeled datasets $EvaS$ .

3.3.3 Dataset For Static Annotater

In annotation generation process, the static annotater includes a core information extraction module (described in detail in Section 4.2.1). Due to the difficulty in accurately completing the required functions using static methods, we use a machine learning model to complete the sequence labeling task of the pseudocode. By constructing the dataset AnnoS, we have constructed the dataset required for model training and testing.

TABLE I: The Proposed Datasets

	Sets for code summary model
Datasets	Size(functions)	Code language	Annotated?	Usage
MalS	89,609	C	No	Train phase1
MalP	500	pseudo	Yes	Test
BenignC	96,835	pseudo	Yes	Train phase2
	Sets for annotation extractor model
Dataset	Size(functions)	Code language	Anno num(avg)	Usage
AnnoS	95,000	pseudo	3.87	Train & Test
	Sets for evaluation model
Dataset	Size(pairs)	Pos\Neg	Length(Avg)	Usage
EvaS	127,510	1:1	9.6	Train & Test

To sum up, we mainly completed the construction of three sets of datasets in different application fields, as shown in Table I.

4 Code Summarization Workflow

Following our breakdown of the malware code summary task in Figure 2, our implementation first extracts the reverse function list, and then sequentially generates static and dynamic annotations for the items in the function list, and finally passes them into the code summary model.

In this process, we completed the training and designing of two Domain-Specific large model, the applying of a General-Purpose large model and the implementation of several algorithms. we cover the implementation of each step separately in this section.

4.1 Function List Extraction

As mentioned earlier in Section 3.1, in the first step of the workflow, we extract the list of reverse functions from the CFG of the malware binary.

Since the existing methods [50] do not give a completely accurate CFG extraction flow, we implement a pluggable CFG extraction module. It is used to provide us with a processing scheme from binary file $B_{Mal}$ to digraph $G_{M}$ as the CFG. By an inverse topological traversal algorithm, it is extracted from CFG in the opposite direction of the call chain, expressed as $L_{G_{M}}=[f_{1},f_{2},...,f_{n}](n=G_{M}.vertices)$ , Where $f_{i}$ represents the $i$ th function of inverse topological order of $G_{M}$ .

In this study, algorithm $REsort$ was constructed, expressed as $L_{G_{M}}=REsort(G_{M})$ . (See Appendix B for details) By applying algorithms such as Tarjan [51], Dijkstra [52], and Depth First Search (DFS), this approach successfully addressed the obstacles caused by cyclic calls and partially connected graphs in the process of generating reverse order lists. Function columns $L_{G_{M}}$ follow the order from outside to inside in the function call diagram to ensure that the later-called function is first in the function list and is processed first in subsequent steps.

The following process takes the function instances from $L_{G_{M}}$ in a forward order to achieve the order of restoration from the outer layer of the CFG diagram to the inner layer, specifically, recovering from the outer API call to the main function.

4.2 Annotation Generation

In the order of traversal provided by the reverse function list, annotations are added to each function in turn to provide richer information. Annotations can be divided into static and dynamic types.

We have observed that attackers frequently employ techniques like stripping to hinder reverse engineering efforts. This process removes crucial symbolic information, such as identifier names, from the binaries, leading to significant semantic loss in the generated pseudocode. As a result, code summary models for malware face substantial challenges, and the performance of code summarization tasks is adversely affected.

To address this issue, we propose utilizing dynamic and static annotation to supplement the semantic information and enhance the features of the stripped pseudocode as Figure 4 shows. This approach aims to compensate for the poor transferability caused by the stark dissimilarity between the source code and the stripped decompiled code.

4.2.1 Static Annotation

Based on our long-term exploration of pseudocode for stripped malware, we deem that although the stripped pseudocode has a serious semantic loss, some extremely critical API calls (such as operating system APIs) and some special forms of strings that are preserved after stripping provide us with ideas for behavior analysis and semantic recovery of malware.

Consequently, we consider building a static annotation module to provide additional information for subsequent code summaries. In general, the static annotation process can be divided into three parts: sequence labeling, online retrieval and annotation generation.

Sequence labeling model: By manually labeling approximately 300,000 tokens within nearly 80,000 functions, we construct a labeled dataset $Set_{SL}$ based on $BenignS$ to train the sequence labeling model.

Formally, the function $f_{i}$ is first sliced into an n-token code sequence by the tokenizer $T$ . The n-token code sequence $s_{i}=T(f_{i})=\{t_{0},t_{1},...,t_{n-1}\}$ and the labels of the tokens $L_{i}=\{l_{0},l_{1},...,l_{n-1}\}$ are combined into $d_{i}=\{(t,l)|t\in s_{i},l\in L_{i}\}$ . Equation (1) formalizes this dataset.

Set_{CSL}=\{d_{0},d_{1},...,d_{N}\}

(1)

Subsequently, we opted to utilize the CodeBERT [39] model for training the sequence labeling task using this dataset CSL.

$B$ is the CodeBERT base model, and $C$ is the classifier head. As a sequence is inputted, the complete model outputs the predicted labels of its tokens. Equation (2) and (3) formalize this.

B(s_{i})=\{o_{0},o_{1},...,o_{n-1}\}\vspace{-0.2cm}

(2)

C(o_{i})=y_{i}

(3)

We further formalize the target function in equation (4), where $l_{i_{c}}$ is the truth label and $y_{i_{c}}$ is the Softmax probability for the $c^{th}$ class.

LF=-\sum\limits_{c=0}^{2}l_{i_{c}}\log{y_{i_{c}}}

(4)

By optimizing $LF$ , $B$ and $C$ are trained simultaneously. This necessitates that the model effectively classifies code-tokens to accurately label the key API calls and special strings within the pseudocode.

Label to annotation: Once we can get the key API calls and special strings in the stripped pseudocode, we use GitHub Code Search [53] to retrieve the relevant context in the GitHub repositories. In the implementation, we keep the first three blocks of the search results (this is because the code search has already sorted the relevance of the results [54]).

The outcome of random sampling and manual discrimination reveals that approximately 54.8% of the function contexts contain code comments closely associated with the functionality of the function. For the remaining functions, nearly 90% also offer contextual information related to the function’s operation, such as parameter names, interconnected functions, and processing logic. Only a small number of functions yield invalid search results.

The filtered and preprocessed code snippets will be continuously input into the prompt-based generic model for generating static annotation.

4.2.2 Dynamic Annotation

As shown in Figure 4 (the blue parts represent the steps in which the annotation was added), the summary of the callee is provided to the caller as a complement to the semantic information, which we define as dynamic annotation. This is consistent with the actual analysis flow of binary malware analysis by reverse workers, i.e., analyzing the call relationship from the inner layer of the CFG diagram to the outer layer (corresponding function list generated in Section 4.1) and summarizing the function from the outside in (corresponding to the passing of dynamic annotation). In this way, we can make full use of the dynamic behavior characteristics implied by the call relationships between functions in the pseudocode.

4.3 Building Malware Datasets

In the traditional scheme of building datasets for decompiled code, the datasets are built at the function level [55]. The source function $f_{n}^{Source}$ is compiled and linked with other modules to generate an executable file, as $f_{n}^{Bin}$ , and then decompiled to obtain the pseudocode form $f_{n}^{pseudo}$ of the corresponding function. Equation (5) formalizes this dataset, where $SUM()$ is the extraction method for the code summary.

Set_{ideal}=\{(f_{n}^{pse},f_{n}^{sum})|f_{n}^{sum}=SUM(f_{n}^{pse})\}

(5)

Out of 2,289 GitHub repositories that were determined to be malware, we extracted close to 30K functions. We filter for functions that repeat, shorter than five lines, and format-challenged functions, resulting in a dataset of 89,609 functions. (The lack of strict filtering may lead to overlap between the train and test sets, consequently yielding inflated results.)

Unlike benign open-source projects that are well maintained, most of the functions we extract do not have comments in context for us to label as code summaries. Fortunately, the semantic information in the source code is rich enough that we used a well-designed prompt to complete the code summary for us via GPT3.5-Turbo [56]. In this way, we extract dataset MalS in equation (6) format. Further, we build a 500-function dataset MalP for testing our model. The reason why dataset MalP has a relatively small quantity is that it was obtained by manually compiling and decompiling the git repository. MalP is compiled from the makefile provided by the developer, so we have not mentioned the configuration related to compilation optimization.

Set_{MalS}=\{(f_{n}^{Sou},f_{n}^{sum})|f_{n}^{sum}=SUM(f_{n}^{Sou})\}

(6)

Set_{MalP}=\{(f_{n}^{pse},f_{n}^{sum})|f_{n}^{sum}=SUM(f_{n}^{Sou})\}

(7)

As mentioned above, the malware training set we built was made up of source code, so another dataset was needed to help our model understand the stripped function features. In the transfer learning option, we used the Capybara dataset (a benign software dataset) provided by BinT5 [19] and annotated it to provide our model with adaptations to the annotated code summary task. It is provided as $Set_{Capybara}$ , and we process it as $Set_{benignC}$ , where the $ANN()$ is the static annotation generation method to form PL-NL bimodal data $f_{n}^{ann}$ .

Set_{Capybara}=\{(f_{n}^{pse},f_{n}^{sum})|f_{n}^{sum}=SUM(f_{n}^{Sou})\}

(8)

Set_{benignC}=\{(f_{n}^{ann},f_{n}^{sum})|f_{n}^{ann}=ANN(f_{n}^{pse})\}

(9)

In the process of training the code summary model, we use $Set_{MalS}$ and $Set_{benignC}$ to complete the transfer learning process, and $Set_{MalP}$ to test during the evaluation phase.

For $Set_{MalS}$ , 25% of the samples were extracted for thematic analysis of function functionality, as shown in Figure 5. Our analysis results confirm that $Set_{MalS}$ and $Set_{BenignC}$ have significantly different theme distributions. Among a large number of security-related functions in the former, there are an unignorable number of codes for lock, permissions, and encryption, while in the latter, there are a large number of codes related to game logic and driver calls which irrelevant to malware.

4.4 Code Summary Model

In Malsight, we fine-tune the CodeT5+ model to enable transfer learning from source code summarization to decompiled pseudocode summarization. To achieve this, we have designed two distinct phases of fine-tuning to facilitate smooth transfers that accommodate variations in data characteristics and distribution biases across different datasets.

Phase 1. In the initial phase, we fine-tune the model that uses the dataset we built consisting of malware source code $Set_{MalS}$ . The source code of malware has a similar semantic structure to the code in ordinary scenarios, but it also has behavioral or semantic features that the latter does not exist, which are represented by some code fragments with malicious purposes, such as self-replication and propagation, illegal access to system resources, and vulnerability exploitation. Through this phase of fine-tuning, the model can learn the behavioral or semantic features of these malicious codes. In this phase, the poor effects of data shift are addressed by using $Set_{MalS}$ with the same functional distribution as malware in the real world.

Phase 2. In this phase, we ask the model to learn the corresponding semantic information from the pseudo-code and the annotation text simultaneously to better assist the model in generating high-quality code summaries. We use the PL-NL bimodal dataset BenignC to fine-tune the encoder of the model which we fine-tuned in the previous phase in case the decoder’s parameters are frozen. By only fine-tuning the encoder, we allow the model to adjust itself to better suit the specific task at hand, without updating the weights too much. This can not only lead to a more robust and generalized understanding of the data but also shorten the training time due to the reduction in parameters. Especially, the raw input is processed into a standardized format as equation (10) where $t_{c_{i}}$ is the tokens from a code sequence, $t_{a_{i}}$ is the tokens from the corresponding annotation, $t_{sep}$ is a special token in CodeT5+ to separate the inputs in different modes. Inputs in such a format can assist the encoder in distinguishing the difference between two modes and learning the correlation between them.

f_{n}^{ann}\to\{t_{c_{0}},...,t_{c_{n-1}},t_{sep},t_{a_{0}},...,t_{a_{m-1}}\}

(10)

After the above two phases of fine-tuning, the model basically has the ability to accept bimodal input composed of pseudocode and annotation text and generate code summary. We use this model to summarize the entire malware in reverse of the function call order.

5 Evaluation Method

As mentioned in Section 3.3.2, the evaluation algorithm for code summary tasks should accept the reference as input and separate available and unavailable generated results. In this section, we introduce our exploration of code summary dataset construction (EvaS) and evaluation model construction (BLEURT-sum) respectively.

5.1 Evaluation Dataset Construction

Utilizing the code summary from the MalS dataset as a foundation for our research, we curated a positive and negative sample pair for the tuning of our new evaluation model BLEURT-sum. The positive sample consisted of two code summary result sentences sharing the same meaning and was initialized as $\{S_{g},S_{r},1\}$ , while the negative sample comprised two randomly different sentences and was initialized as $\{S_{g},S_{r},0\}$ . $S_{g}$ and $S_{r}$ represent generated statement and reference statement, respectively.

In order to build the dataset in $\{S_{g},S_{r},Score\}({Score}\in[0,1])$ format, one possible idea is to build an algorithm, $Score=GenSim(S_{g},S_{r},0or1)$ , to automatically generate the label $Score$ required for model training. Therefore, the key task is to construct a reliable $GenSim()$ function. When $S_{g}==SUM(f_{n})$ , contrusted $Score$ can be expressed using equation (11), considering that the SUM() function produces different outputs when faced with the same input, which is a non-deterministic function.

S_{r}\!=\!\text{SUM}(f_{n})\!\implies\!Score\!=\!\begin{cases}1\!&\!\text{if }% S_{g}=\text{SUM}(f_{n})\\ 0\!&\!\text{if }S_{g}=\text{SUM}(\neg f_{n})\end{cases}

(11)

Since when $\{S_{g},S_{r}\}$ pairs were built, no sentence structure dependencies are taken into account, which means the equation (11) of Score can contain a few words overlap-based features (§ 2.3), the natural consideration is to combine the characteristics of sentence structure and semantics. Our idea is to solve the proportion of semantic information and sentence structure information in sentence similarity evaluation.

Taking the original 0,1 tag as the semantic feature, we further extract the static feature to get a Score, multiply the two by the corresponding proportion respectively, and then add them to get $Score$ . Assuming that the proportion of semantic information is $p$ , the structural feature calculation function is $Struc()$ , the following equations formalize this Score construction method.

	$\displaystyle s_{f}=Struc(S_{g},S_{g})$		(12)
	$\displaystyle Score\!=\!\begin{cases}p+(1-p)s_{f}\!&\!\text{if}S_{g}=\!\text{% SUM}(f_{n})\\ (1-p)s_{f}\!&\!\text{if}S_{g}=\!\text{SUM}(f_{\neg n})\end{cases}$		(13)

In the implementation, we utilize the arithmetic average of BLEU, ROUGE-L, and METEOR as the metric for $Struc()$ (they have been normalized to the zero-one interval using a uniform probability distribution, respectively). Then we solve the minimum square error and get the semantic and structural ratio close to 1:4, which makes $p=1/5$ .

5.2 Evaluation Model

Based on our insights into the evaluation model, it should be able to combine both structural and semantic features and give an evaluation score for the specific task of code summary.

BLEURT is based on BERT and adds additional pre-training steps between pre-training and fine-tuning to the synthesized data. Synthetic data is generated by perturbing sentence pairs $<z,\widetilde{z}>$ , where $z$ and $\widetilde{z}$ are randomly selected sentence pairs. In the additional pre-training of BLEURT, a series of pre-training signals $(\tau_{1},\tau_{2},...\tau_{9})$ to align the model with the desired result. BLEURT uses the sentence pair scores of BLEU, ROUGE, and BERTScore as signals $\tau_{1}$ to $\tau_{3}$ , and uses the back translation processing sentence pairs to generate $\tau_{4}$ to $\tau_{7}$ .

The rich training signals ensure the universality of BLEURT and the comprehensiveness of the evaluation angle. We then further fine-tune the BLEURT model on the code summary sentence pair dataset. First, the generated text and the reference text are input together into the model and the vector is generated as shown in equation (14). This solution is called BLEURT-sum.

v_{[CLS]},v_{S_{g1}},...,v_{S_{gn}},...,v_{S_{rn}}=BLEURT(S_{g},S_{r})

(14)

Further, the model uses the CLS vector to add antecedents to obtain the scores predicted by the model, as shown in equation (15).

\hat{Score}=f(S_{g},S_{r})=W\tilde{v}_{[CLS]}+b

(15)

Finally, the model starts to complete the loss calculation based on the loss and then carries out the gradient descent (equation (16)).

loss=\frac{1}{N}\sum\limits_{n=1}^{N}\left|\left|Score-\hat{Score}\right|% \right|^{2}

(16)

In existing works, Word overlap measures are commonly used in text similarity evaluation but perform poorly for code summary evaluation. Code summaries often include keywords like “retur” and “initializ”, which do not significantly contribute to the overall meaning. These keywords can inflate overlap rates and lead to misleadingly high scores. Therefore, BLEURT-sum on the one hand still considers Word overlap measures, and on the other hand introduces more dimensions, which greatly improves the performance in code summary evaluation tasks.

6 Experiments

Our evaluation experiment was designed to answer the following four questions:

1.

RQ1 (§ 6.2). How does Malsight performance compare across two training phases and mainstream code summarization models?
2.

RQ2 (§ 6.3). How do module combinations and different stripping scenarios affect Malsight’s performance, particularly the annotator module?
3.

RQ3 (§ 6.4). How does the new BLEURT-sum evaluation method compare to existing code summarization evaluation metrics?
4.

RQ4 (§ 6.5). How does Malsight perform when applied to real-world malicious software, and how do assessments from human reverse engineers validate its usability?

6.1 Experimental Setup

Our experiment is running on the Ubuntu 20.04 system, equipped with one Intel Xeon Silver 4210 CPU 2.20 GHz, and two NVIDIA A40 GPUs with 125 GB RAM. Binary file processing tools include Radare2, IDA Pro v7.5, GCC v9.4.0, and GNU Make v4.2.1. Our programming language is Python v3.8.13, with transformers v4.16.2, torch v2.1.2+cu121.

6.2 RQ1. Performance Test

6.2.1 Code Summary Performance

During training, the model undergoes two phases. The first phase involves fine-tuning malware source code (MalS) to learn malicious behavioral features. The second phase focuses on generating high-quality code summaries (BenignC) by incorporating semantic information from pseudocode and annotations. In both phases, we split the dataset into training, cross-validation, and test sets with a ratio of 7:2:1. We then evaluated the model at each stage to ensure its effectiveness. The result is shown in Table II.

Since each phase uses a different dataset, their effectiveness should be verified independently rather than by direct score comparison. Considering there is no universally accepted threshold for model evaluation scores, we provide the performance of another state-of-the-art binary code summarization model (CP-BCS [55]) as a baseline. CP-BCS was tested on its own dataset, which consists of binaries compiled with GCC 7.3.0 for x86 architecture (32-bit) and then stripped. Our results in both phases demonstrate higher usability compared to CP-BCS.

TABLE II: Score During Two Train Phase

	BLEURT-sum	BLEU	ROUGE-L	METEOR
Phase1	74.17	18.06	38.56	24.69
Phase2	72.74	22.61	41.19	25.37
CP-BCS(Baseline)	17.78	21.50	16.89	11.92

After completing the training step, we use MalP as the test set (it is the only pseudocode summary dataset built for malware that we know until now). We compare Malsight to existing pseudocode summarization methods and the popular prompt-based general-purpose large model on dataset MalP. In the actual experiment, we built three versions of data for MalP: not-stripped, demi-stripped (only stripping the function name), and all-stripped (stripping the function name and all identifiers inside the function).

The closest version of the dataset to the real world, that is, the results of the all-stripped version as test set, are shown in Table III.

TABLE III: Comparison with baseline work

	BLEURT-sum	BLEU	ROUGE-L	METEOR	AVG time (function)	AVG summary length	BLEURT-sum variance
BinT5	21.18	1.92	9.45	3.51	0.16	6.81	201.11
HexT5	27.67	2.71	11.23	3.22	0.13	7.11	216.81
WizardCoder-15B	53.43	7.75	23.16	13.71	2.44	32.20	303.03
Code Llama-7b	56.00	8.52	24.55	14.95	2.24	31.32	290.16
Code T5+	17.18	1.74	4.17	2.54	0.1743	7.2869	161.34
WizardLM-2-7B	55.81	5.33	18.46	15.61	22.07	61.74	265.63
deepseek-llm-7b-chat	50.88	7.07	20.08	12.70	7.41	25.48	306.78
ChatGPT-3.5	60.09	9.96	25.19	16.54	-	25.27	296.81
Malsight (Ours)	62.14	9.80	25.11	16.87	2.51	35.27	131.65
* AVG time (function) is measured in seconds.
* AVG summary length is measured in words.

As shown in Table III, we conducted experiments using evaluation methods such as BLEURT-sum, BLEU, ROUGE, and METEOR (the usability of BLEURT-sum is substantiated in the referenced paper) to carry out experiments on Malsight, WizardLM [57], and Code Llama [58], etc. We also measured model performance using criteria such as summary length, variance, and processing time, as these factors can directly impact algorithm-based evaluations (see Appendix I for details).

As previously mentioned, general-purpose large models, trained on extensive corpora, possess broader knowledge, providing an advantage in code summarization tasks. Among these, GPT, as a commercial large model, performed the best with a BLEURT-sum score of 0.6009. Malsight’s innovative fine-tuning approach offers a significant advantage in summarizing malware code, with its annotation generation effectively bridging the knowledge gap between specialized and general-purpose models. Malsight achieved the highest scores among all methods in both BLEURT-sum and METEOR, which we consider two of the most reliable evaluation metrics.

This conclusion can be corroborated by Figure 6. The figure shows that Malsight has better score evaluation results and less variance, i.e., more stable code summary output, on test results stripped of data.

6.2.2 Annotation Extraction Effect

The annotation extraction model was trained and tested on the AnnoS dataset, and the test results are shown in the table The annotated extraction model was trained and tested on the AnnoS dataset to complete the sequence SL task, dividing the data into normal codes (represented by N-label), important function APIs (represented by A-label), and important strings (represented by S-label). The model test produces the confusion matrix shown in Table IV below.

TABLE IV: Annotation extraction confusion matrix

Actual/Predicted	N-label(Predicted)	A-label(Predicted)	I-label(Predicted)
N-label(Actual)	1,847,647	9,345	11,721
A-label(Actual)	15,865	116,099	1,784
I-label(Actual)	21,639	1,688	38,933

The results show that the accuracy of the model is 96.99%, which is almost comparable to the results obtained by manual annotation.

6.3 RQ2. Ablation Experiment

To verify the usability of each module in Malsight, we conducted ablation experiments on the annotation module and the two phases of code summarization. We continued to use MalP as a test dataset, experimenting with different module combinations.

6.3.1 Module Ablation

As shown in Table V, the modules in Malsight complemented each other effectively. This ablation study was conducted on the not-stripped version of MalP to minimize result fluctuations due to the absence of dynamic annotation.

TABLE V: Ablation experiment

Model	BLEURT-sum	BLEU	ROUGE-L	METEOR
Full Model	66.29	10.95	26.97	18.80
w/o Annotation	60.05	7.65	22.76	14.93
w/o Phase 1	62.05	10.18	24.16	17.87
w/o Phase 1 and Annotation	56.48	7.35	21.99	13.99
w/o Phase 2	61.93	9.19	26.25	15.37
w/o Phase 2 and Annotation	59.13	7.42	26.72	13.32

TABLE VI: Performance on different levels of stripping

Stripping Levels	BLEURT-sum	BLEU	ROUGE-L	METEOR	AVG time (function)	AVG summary length	BLEURT-sum variance
Not-Stripped	66.29	10.95	26.97	18.80	2.28	32.85	111.39
Not-Stripped w/o Annotation	60.05	7.65	22.76	14.93	2.01	29.12	143.62
Demi-Stripped	64.01	10.42	26.03	17.62	2.44	34.29	123.30
Demi-Stripped w/o Annotation	58.04	8.27	23.36	14.25	2.12	30.92	146.90
All-Stripped	62.14	9.80	25.11	16.88	2.51	35.27	131.65
All-Stripped w/o Annotation	54.98	7.82	22.31	13.36	2.16	32.40	150.75
* AVG time (function) is measured in seconds.
* AVG summary length is measured in words.

We designed five experiments to test different combinations for the code summarization task, as follows: (1) Removing the annotation module and summarizing without any annotation, labeled as “w/o Annotation”; (2) Canceling the phase 1, labeled as “w/o Phase ”; (3) Canceling the phase 1 and removing the annotation module, labeled as “w/o Phase 1 and Annotatio”; (4) Canceling the phase 2, labeled as “w/o Phase 2”; (5) Canceling the phase 2 and removing the annotation module, labeled as “w/o Phase 2 and Annotation”.

The results indicate that different degrees of ablation have varying negative impacts on systematic performance. “w/o Phase 1 and Annotation” suffered a data shift and thus exhibited worse performance on the malware dataset than the other combinations, 56.48 in BLEURT-sum. Notably, the absence of the annotation generation module (static annotation generation module) significantly affects performance, demonstrating its effectiveness.

6.3.2 Annotater vs. Stripping

The influence of different levels of strip on the annotation module continues to be explored, and the results are shown in Table VI. Experiments have shown that a higher degree of stripping creates a tolerable performance degradation in the absence of annotation, demonstrating the robustness of Malsight. After the annotation module was ablated, it produced a 9.41% performance degradation (measured by a BLEURT-sum score) in the not-stripped test and nearly 12% in the all-stripped test, which proved that the annotator effectively resisted the adverse conditions caused by stripping.

In the generated results, we selected relatively representative sentences, as shown in Figure 7. In the yellow box, the output from the generation module without annotations shows that the large model is highly susceptible to biases due to hallucination [59]. Based on our observations, hallucinations can cause the code summary model to make false guesses about the types of parameters and internally called functions, leading to potential user misdirection. To mitigate this issue, we incorporate annotations, which not only enhance the model’s ability to summarize behavior, structure, and application scenarios but also reduce the nonsensical outputs caused by these illusions.

6.4 RQ3. Evaluation Algorithm Test

The EvaS dataset was constructed to evaluate our code summary evaluation method alongside other popular methods. As previously mentioned, we use both positive and negative samples to test the effectiveness of these evaluation methods by assessing their ability to distinguish between the two. The performance of existing methods, illustrated in Figure 8, shows varying degrees of crossover between positive and negative samples for BLEU, METEOR, word2vec, and MoverScore. These four methods represent the current approaches in words’ overlap measure(BLEU and METEOR) and words’ embedding measure(word2vec and MoverScore), respectively.

In Figure 8, we have established two types of decision boundaries: one being orthogonal to the X-axis (i.e., distinguishing between positive and negative samples by setting a threshold), and the other being a linear function forming a slanted line. The same process is represented by BLEURT-sum as shown in Figure 9, which shows the superiority of our method.

Specifically, we calculated an F1-score for these six methods (ROUGE-L is not shown in the Figure 8) on the positive and negative sample classification task of code summary sentences to measure their ability. Our method achieved an F1-score exceeding 0.9999, significantly outperforming all other evaluation methods. Among the existing methods, METEOR performed the best with an F1-score of 0.9811, while BLEU had the lowest performance at only 0.85 as the lowest. None of the other methods achieved an F1-score above 0.95. Notably, our dataset was not meticulously curated; the negative samples were entirely random sentences with different meanings, which differ from real-world scenarios where the differences in meaning might be subtler. An F1-score below 95% may therefore indicate an unacceptable tendency to misclassify. This is because a 95% F1-score means that approximately 1 in 10 samples are incorrectly judged while upholding a high recall rate. As an evaluation method, the misjudgment can be further amplified by the model results evaluated using the evaluation scheme in the following works.

6.5 RQ4. Real World Experiments

6.5.1 Human Evaluation Experiments

So far, we have evaluated Malsight’s performance at the function level. However, this approach did not allow us to fully showcase the capabilities of the dynamic annotation module. Additionally, real-world malware predominantly manifests in executable files. Hence, we manually analyzed 10 real-world malware samples and selected 79 critical functions out of them, three experienced reverse engineers add summaries to these functions based on discussion. The complete workflow of Malsight was applied to these functions and the outputs was evaluated using both BLEURT-sum and human evaluation metrics. We further calculated the variance deviation of all calculated results as a reference to demonstrate the stability of our solution and the time required for code summarization for a single malware sample.

For the human evaluation metrics, we invited ten evaluators, including five reverse engineers who are rich in experience, three experienced, and two beginners. They were asked to focus on evaluating the usability of the code summarization results and provide a score of 0, 0.5 or 1, indicating whether the summary corresponded to the original code. It is noteworthy that due to the subjectivity of human evaluation of the summaries, different evaluators may have varying opinions on the summary of the same code. Therefore, we opted to evaluate usability only and did not solicit more detailed scores beyond 0,0.5 and 1, respectively corrrepond to usable, partially usable and unusable.

6.5.2 Performance From Different Evaluators

After three sets of scores, we obtained the human assessment scores shown in Table VII. Experienced evaluators tend to give more conservative and stable scores, while less experienced evaluators give a wider range of scores with higher variance. Further, we take the arithmetic average of the scores received by these evaluators as the final human assessment score.

TABLE VII: Evaluation of Different Categories of Evaluators

	Rich in experience	Experienced	Beginner
Score	58.14	59.51	64.73
Variance	173.12	197.49	237.44

Table b shows the mean value of BLEURT-sum and the mean value of human evaluation on 79 labeled functions. The human evaluation score of 59.87 means that most functions have exceeded 0.5, i.e. the partially usable standard.

TABLE VIII: Real Malware Performance

* AVG time (per file) is measured in hours.
	BLEURT-sum	Human Assessment Score	AVG time (file)
Score	47.22	59.87	1.90
Variance	161.61	0.06	1.21

Further, we formed a data point by combining a BLEURT-sum score with a human evaluation score for the same function’s code summary, thus, we plotted a distribution of 79 data points in Figure 10. It indicates that BLEURT-sum has a positive linear correlation with human evaluation metrics, which shows the rationality of BLEURT-sum. Furthermore, the data distribution indicates that, across the two distinct evaluation metrics, the majority of the evaluation scores for Malsight’s outputs are concentrated within the middle to high range, which shows that our scheme is also usable in the real world.

7 Case Study

In section RQ4, we evaluate Malsight’s performance on 10 real-world malware. To provide more direct evidence and analyze the reasons why Malsight behaves differently on different functions, three functions are presented in this section to discuss the process that Malsight goes through behind the output that is usable, the output that is partially usable, and the output that is unusable.

Case 1: sub_401021.

•

Malsight Summary: Set privilege, the code snippet is difficult to ascertain the function’s purpose, but it likely retrieves or obtains professional information or status.
•

Human Summary: Escalating the privileges of the virus process.

Malsight’s summary of the function shown in Figure 11 is considered usable. In this case, Malsight summary accurately identifies “setting privileges” as the key point from a lengthy function, which aligns with the task of “privilege escalation” in the human summary.

Case 2: sub_401D30.

•

Malsight Summary: Convert a string to a wide character, the code defines a function that performs string operations, handles exceptional conditions, and creates new objects, potentially Get the server name, the code defines a function that opens a WinHttp connection, connects to a server, and handles error messages and string operations Get the next character from the string, the code defines a function sub_401470 that performs string operations, handles exceptional conditions, and Delete the server.
•

Human Summary: Connect to a malicious domain and make a request to retrieve data using HTTP.

Malsight’s summary of the function shown in Figure 12 is considered partially usable. Malsight summary identifies that the function uses an HTTP connection to a specific server and operates on the data, which is consistent with the human summary. However, due to the lack of some background information, it fails to recognize that the specific server is malicious.

Case 3: TimerFunc.

•

Malsight Summary: Get the command line, the code defines a function that handles or processes data related to a hexadecimal address, retrieves information about Redis commands and modules, Call the NtTib. ExceptionList, the code defines a function that handles or processes data related to a hexadecimal address, retrieves information about Copy the setup.exe, the code defines a function that handles or processes data related to a hexadecimal address, retrieves information about Redis commands and modules Setup the setup, the code defines a function that handles or processes data related to a hexadecimal address, retrieves information about Redis commands and modules.
•

Human Summary: Checking if the autorun.ini file exists; if not, create it and configure it to auto-run setup.exe as specified in the configuration file.

Malsight’s summary of the function shown in Figure 13 is considered unusable. The function is very long, contains numerous function calls, and includes many redundant functions that hinder analysis, resulting in a poor Malsight summary.

8 Discussion

8.1 Ethics

The construction of the dataset MalS and MalP complies with GitHub’s open-source license agreements. We only used open-source repositories with GitHub open-source licenses and downloaded these open-source codes within the rate limit set by GitHub to minimize interference with GitHub’s servers to the lowest extent possible. Additionally, We enlisted volunteers to review the MalS and MalP datasets, as well as the 10 malware samples mentioned in RQ4. With the volunteers’ consent, we adopted their review results as the final dataset.

8.2 Limitations

In this section, we delve into practical issues based on our analysis of the experimental results. Additionally, we explore aspects not covered in our work and propose potential solutions.

Real-World Malware vs. Malware Function Summaries: In our exploration of real-world malware, the BLEURT-sum evaluation score of the code summary was approximately 30% lower than the experimental score obtained by MalP in Section 6.2.

The reason could be that our code summarization model lacks the ability to summarize longer functions which occasionally appear in real-world malware. Our analysis shows that about 15% of the functions in the malware contain more than 1000 tokens (while the majority of the functions in the training set stay below 300 tokens), which is likely due to different optimization configurations during compilation. These lengthy function bodies introduce a lot of information and noise, which makes it difficult for the model to extract and summarize the critical code fragments.

Therefore, to better align with real malicious code summaries, researchers should consider doing code summarization work in units of code fragments instead of functions. However, identifying and summarizing important, human-interpretable segments in them requires large amounts of labeled data, or compositional functional insights from dynamic debugging, which poses significant challenges.

Summary for assembly language rather than pseudocode: Efforts have been made to recover information from assembly code and to generate function summaries [60]. A common perspective is that decompiled assembly code suffers varying degrees of information loss during the generation of pseudocode as Intermediate Representation (IR), depending on the disassembly algorithm used. Thus, starting directly from assembly language is considered a viable solution.

In our comparison of assembly code and pseudocode representations, we found that assembly code is challenging for models pre-trained on high-level languages to understand [32]. The structural features of assembly code are completely different from those of high-level languages. Fine-tuning a model using pseudocode can confuse the model’s understanding of function-level structural features, leading to unacceptable error output, possibly due to insufficient datasets.

We train an embedding model specifically for assembly language and build a large dataset at the assembly language level for malware. This method has the potential to outperform general-purpose large models, which typically have a poor understanding of assembly code.

9 Conclusion

In this paper, we introduced Malsight, a novel framework for binary malware summarization by exploring malicious source code and benign pseudocode. The proposed Malsight involves two datasets MalS and MalP, an LLM-based summary model, and an evaluation metrics. Experimental results on three datasets show the effectiveness of the proposed framework. Future work includes the application of the proposed framework to more downstream tasks.

References

[1] “Malware statistics &trends report — av-test,” https://www.av-test.org/en/statistics/malware/, accessed: 2024-06-2.
[2] X. Shang, S. Cheng, G. Chen, Y. Zhang, L. Hu, X. Yu, G. Li, W. Zhang, and N. Yu, “How far have we gone in stripped binary code understanding using large language models,” arXiv preprint arXiv:2404.09836, 2024.
[3] A. Jain, S. Soner, and A. Gadwal, “Reverse engineering: Journey from code to design,” in Proc. of ICECT, 2011.
[4] F. A. Aboaoja, A. Zainal, F. A. Ghaleb, B. A. S. Al-rimy, T. A. E. Eisa, and A. A. H. Elnour, “Malware detection issues, challenges, and future directions: A survey,” Applied Sciences, vol. 12, no. 17, 2022.
[5] C. Beaman, A. Barkworth, T. D. Akande, S. Hakak, and M. K. Khan, “Ransomware: Recent advances, analysis, challenges and future research directions,” Computers & Security, vol. 111, p. 102490, 2021.
[6] M. Yao, J. Fuller, R. P. Sridhar, S. Agarwal, A. K. Sikder, and B. Saltaformaggio, “Hiding in plain sight: an empirical study of web application abuse in malware,” in Proc. of USENIX Security, 2023.
[7] D. Gibert, C. Mateu, and J. Planes, “Hydra: A multimodal deep learning framework for malware classification,” Computers & Security, vol. 95, p. 101873, 2020.
[8] R. Labaca-Castro, B. Biggio, and G. Dreo Rodosek, “Poster: Attacking malware classifiers by crafting gradient-attacks that preserve functionality,” in Proc. of CCS, 2019.
[9] A. Marcelli, M. Graziano, X. Ugarte-Pedrero, Y. Fratantonio, M. Mansouri, and D. Balzarotti, “How machine learning is solving the binary function similarity problem,” in Proc. of USENIX Security, 2022.
[10] Z. Yu, R. Cao, Q. Tang, S. Nie, J. Huang, and S. Wu, “Order matters: Semantic-aware neural networks for binary code similarity detection,” in Proc. of AAAI, 2020.
[11] Z. Liu, “Binary code similarity detection,” in Proc. of ASE, 2021.
[12] X. Deng and J. Mirkovic, “Malware analysis through high-level behavior,” in Proc. of CSET, 2018.
[13] B. Cornelissen, A. Zaidman, A. Van Deursen, L. Moonen, and R. Koschke, “A systematic survey of program comprehension through dynamic analysis,” IEEE Transactions on Software Engineering, vol. 35, no. 5, pp. 684–702, 2009.
[14] M. Kim, H. Cho, and J. H. Yi, “Large-scale analysis on anti-analysis techniques in real-world malware,” IEEE Access, pp. 75 802–75 815, 2022.
[15] H. Rays. State-of-the-art binary code analysis solutions. [Online]. Available: https://www.hex-rays.com/products/ida/
[16] NationalSecurityAgency. Ghidra is a software reverse engineering (sre) framework. [Online]. Available: https://github.com/NationalSecurityAgency/ghidra
[17] G. Sridhara, E. Hill, D. Muppaneni, L. Pollock, and K. Vijay-Shanker, “Towards automatically generating summary comments for java methods,” in Proceedings of the 25th IEEE/ACM international conference on Automated software engineering, 2010, pp. 43–52.
[18] P. W. McBurney and C. McMillan, “Automatic documentation generation via source code summarization of method context,” in Proceedings of the 22nd International Conference on Program Comprehension, 2014, pp. 279–290.
[19] A. Al-Kaswan, T. Ahmed, M. Izadi, A. A. Sawant, P. Devanbu, and A. van Deursen, “Extending source code pre-trained language models to summarise decompiled binarie,” in Proc. of SANER, 2023.
[20] J. Xiong, G. Chen, K. Chen, H. Gao, S. Cheng, and W. Zhang, “Hext5: Unified pre-training for stripped binary code information inference,” in Proc. of ASE, 2023.
[21] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” in Proc. of ICLR, 2023.
[22] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large language models with evol-instruct,” arXiv preprint arXiv:2306.08568, 2023.
[23] Y. Wang, H. Le, A. Gotmare, N. Bui, J. Li, and S. Hoi, “CodeT5+: Open code large language models for code understanding and generation,” in Proc. of EMNLP, 2023.
[24] A. Mastropaolo, M. Ciniselli, M. Di Penta, and G. Bavota, “Evaluating code summarization techniques: A new metric and an empirical characterization,” in Proc. of ICSE, 2024.
[25] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proc. of ACL, 2002.
[26] A. Lavie and M. Denkowski, “The meteor metric for automatic evaluation of machine translation,” Machine Translation, vol. 23, pp. 105–115, 2009.
[27] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. of ACL, 2004.
[28] Fortinet. What is malware analysis? types and stages of malware analysis. [Online]. Available: https://www.fortinet.com/resources/cyberglossary/malware-analysis
[29] K. Yakdan, S. Dechand, E. Gerhards-Padilla, and M. Smith, “Helping johnny to analyze malware: A usability-optimized decompiler and malware analysis user study,” in Proc. of SP, 2016.
[30] Hex-Rays. Decompilation vs disassembly. [Online]. Available: https://hex-rays.com/decompiler/decompilation_vs_disassembly/
[31] R. Team. Radare2: a reverse engineering framework. [Online]. Available: https://github.com/radareorg/radare2
[32] H. Tan, Q. Luo, J. Li, and Y. Zhang, “Llm4decompile: Decompiling binary code with large language models,” arXiv preprint arXiv:2403.05286, 2024.
[33] K. Pal, A. Bajaj, P. Banerjee, A. Dutcher, M. Nakamura, Z. Basque, H. Gupta, S. Sawant, U. Anantheswaran, Y. Shoshitaishvili, A. Doupe, C. Baral, and R. Wang, “;len or index or count, anything but v1": Predicting variable names in decompilation output with transfer learning,” in Proc. of SP, 2024.
[34] Microsoft Corporation. Main() and command-line arguments - C#. [Online]. Available: https://learn.microsoft.com/en-us/dotnet/csharp/fundamentals/program-structure/main-command-line
[35] E. C. R. Shin, D. Song, and R. Moazzezi, “Recognizing functions in binaries with neural networks,” in Proc. of USENIX Security, 2015.
[36] D. M. Berris, A. Veitch, N. Heintze, E. Anderson, and N. Wang, “Xray: A function call tracing system,” Technical report, 2016. A white paper on XRay, a function call tracing system developed at Google, 2016.
[37] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. of NAACL-HLT, 2019.
[38] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
[39] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” in Proc. of EMNLP, 2020.
[40] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proc. of EMNLP, 2021.
[41] J. Jiang, Y. Shu, J. Wang, and M. Long, “Transferability in deep learning: A survey,” 2022.
[42] S. Haque, Z. Eberhart, A. Bansal, and C. McMillan, “Semantic similarity metrics for evaluating source code summarization,” in Proc. of ICPC, 2022.
[43] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[44] W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger, “MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance,” in Proc. of EMNLP-IJCNLP, 2019.
[45] M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger, “From word embeddings to document distances,” in Proc. of ICML, 2015.
[46] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” CoRR, vol. abs/1802.05365, 2018.
[47] T. Sellam, D. Das, and A. Parikh, “BLEURT: Learning robust metrics for text generation,” in Proc. of ACL, 2020.
[48] L. Yang, A. Ciptadi, I. Laziuk, A. Ahmadzadeh, and G. Wang, “Bodmas: An open dataset for learning based temporal analysis of pe malware,” in Proc. of DLS, 2021.
[49] M. O. F. Rokon, R. Islam, A. Darki, E. E. Papalexakis, and M. Faloutsos, “SourceFinder: Finding malware Source-Code from publicly available repositories in GitHub,” in Proc. of RAID, 2020.
[50] W. Zhu, Z. Feng, Z. Zhang, J. Chen, Z. Ou, M. Yang, and C. Zhang, “Callee: Recovering call graphs for binaries with transfer and contrastive learning,” in Proc. of SP, 2023.
[51] R. Tarjan, “Depth-first search and linear graph algorithms,” in Proc. of SWAT, 1971.
[52] E. W. Dijkstra, “A note on two problems in connexion with graphs,” Numer. Math., vol. 1, no. 1, p. 269–271, 1959.
[53] GitHub. Github code search. [Online]. Available: https://docs.github.com/en/github/searching-for-information-on-github/searching-code
[54] ——. The technology behind github’s new code search. [Online]. Available: https://github.blog/2023-02-06-the-technology-behind-githubs-new-code-search/
[55] T. Ye, L. Wu, T. Ma, X. Zhang, Y. Du, P. Liu, S. Ji, and W. Wang, “CP-BCS: Binary code summarization guided by control flow graph and pseudo code,” in Proc. of EMNLP, 2023.
[56] OpenAI. ChatGPT: Optimizing Language Models for Dialogue. [Online]. Available: {https://chat.openai.com/}
[57] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, Q. Lin, and D. Jiang, “WizardLM: Empowering large pre-trained language models to follow complex instructions,” in Proc. of ICLR, 2024.
[58] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” 2024.
[59] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023.
[60] X. Li, Y. Qu, and H. Yin, “Palmtree: Learning an assembly language model for instruction embedding,” in Proc. of CCS, 2021.

Appendix A Analysis of Traditional Similarity Evaluation Metric

The criteria for evaluating sentence similarity should focus on two key aspects: similar sentences should correspond to higher scores, while dissimilar sentences should correspond to lower scores. Below, we introduce some of the shortcomings of BLEU and other traditional metrics (ROUGE, METEOR) in these aspects.

A.1 Limitations of Overlap Based Methods

Problems caused by BLEU algorithm: The formula BLEU uses to calculate sentence similarity is shown in equation 17:

BLEU=BP*exp(\sum\limits_{n=1}^{N}w_{n}\ln(Precision_{n}))

(17)

Where $BP$ is a brevity penalty factor, $w_{n}$ is the weight of the $n$ -gram, and $Precision_{n}$ is the precision of the generated candidate sentence. Specifically, equation 18 shows the construction of $Precision_{n}$ , where $candidate\&reference$ is the number of overlapping occurrences between the candidate sentence and the reference sentence.

Precision_{n}=\frac{len(candidate\&reference)+1}{len(candidate)+1}

(18)

In this context, the Add-One Smoothing method is used to avoid zero-count problems when the n-gram size is large. However, this introduces another issue: even if the candidate sentence and reference sentence are completely unrelated, this smoothing method still produces a certain score. This effect is particularly noticeable in the case of short sentences.

We conducted experimental tests for this scenario, testing each reference-candidate pair within the sentence length range of [1,30]. Each sentence pair had zero word overlap, and our results are shown in Figure 14.

Even when sentence pairs are completely mismatched, BLEU scores greater than 0.3 can occur for shorter sentences. This significant deviation from reality indicates that BLEU’s scoring is distorted for short sentences in some cases.

ROUGE & METEOR, The flaw of calculating similarity in basic units of words: ROUGE and METEOR have something in common in the construction of sentence similarity evaluation algorithms. ROUGE follows the following equation 19.

ROUGE-L=F_{LCS}=\frac{(1+\beta^{2})R_{LCS}P_{LCS}}{R_{LCS}+\beta^{2}P_{LCS}}

(19)

Where $R_{LCS}=\frac{LCS(C,S)}{len(S)}$ , $P_{LCS}=\frac{LCS(C,S)}{len(C)}$ and $\beta$ is used to give weight to recall rates. $LCS(C,S)$ is used to calculate the length of the common substring of two strings $C$ and $S$ . Subjectively, when two target sentences have a higher degree of overlap in a specific word, they are given a higher ROUGE score.

METEOR designs on the basis of rouge, following equation 20, 21 and 22.

F_{mean}=\frac{(1+\beta^{2})PR}{R+\beta P}

(20)

Penalty=\gamma(\frac{chunks}{unigrams\_matched})^{\theta}

(21)

METEOR=F_{mean}(1-Penalty)

(22)

Where $P=\frac{n}{len(candidate)}$ , $R=\frac{n}{len(reference)}$ and $n$ are the number of words where the candidate sentence and reference sentence overlap. METEOR employs exact matching, stem matching, and WordNet-based synonym matching to address the issue of words with identical meanings not being recognized as overlaps. Consequently, METEOR outperformed ROUGE in our experiments, ranking second only to BLEURT-sum. However, algorithms that rely solely on word overlap can still misjudge due to structural similarities in sentences or phrases. For instance, in code summarization, sentences might share terms like ”function”, ”aims”, or ”code,” or convey the same idea using different wording, such as ”compare two sentences” versus ”bitwise and return true/false.”

Interestingly, we found that expanding the stopword list appropriately can enhance the performance of word overlap-based methods, especially METEOR.

A.2 Limitations of Embedding Based Methods

In recent years, using word embedding-based methods such as word2vec to determine sentence similarity has become quite popular. For instance, the pseudocode for the word2vec process is illustrated in Algorithm 1.

Algorithm 1 word2vec

1:The reference sentence

ref

and the candidate sentence

cad

2:the similarity

score

ref

and

cad

3:function Main(

cad,ref

)

4: cad.vector = getVector(

cad

)

5: ref.vector = getVector(

ref

)

\text{score}=\frac{\text{cad}\cdot\text{ref}}{||\text{cad}||\cdot||\text{ref}||}

7: return score

8:end function

9:function getVector(

sentence

)

10: words = tokenize(sentence)

11: sentence.vector = [0,0,..]

12: for

\text{word}\in\text{words}

13: sentence.vector += word.vector

14: end for

15: return sentence.vector

16:end function

When calculating the similarity between two sentences, word2vec adds the vectors of each word based on the tokenized results and uses the summed vector as the sentence representation. Finally, cosine similarity is employed to determine the similarity between the sentences. A direct drawback of this approach is that it loses all word order information and heavily depends on the accuracy of the embedding algorithm.

Appendix B Reverse CFG Sorting Algorithm

Algorithm 2 REsort

1:The CFG graph

G_{M}

for a malware binary file.

2:The reverse topsort list

L_{G}

of the function call graph.

3:function REsort(

G_{M}

)

4: for each vertex

v

G_{M}.vertices

5: if

!v.seen

then

6: new

G_{M}^{\prime}

G_{M}^{\prime}.vertices\leftarrow

call Tarjan(

v

G

)

8: //Tarjan() traverses all strongly connected components in

G_{M}

and contracts vertices as

G_{M}^{\prime}.vertices

9: end if

10: end for

11:

G_{M}^{\prime}\leftarrow

call BuildTarGraph(

G_{M}^{\prime}.vertices

G_{M}

)

12: //BuildTarGraph() build a new directed acyclic graph

G_{M}^{\prime}

using

G_{M}^{\prime}.vertices

and

G_{M}.edges

13:

L_{G_{M}^{\prime}}\leftarrow

call RetopSort(

G_{M}^{\prime}

)

14: //RetopSort() obtains the reverse topological sorting sequence of

G_{M}^{\prime}

15:

dist[]\leftarrow

call Dijkstra(

G_{M}

)

16: //Dijkstra() computes the multi-source shortest paths in graph

G_{M}

from vertices with an in-degree of 0 to each vertix.

17:

L_{G_{M}}\leftarrow[]

18: for each vertex

v^{\prime}

L_{G_{M}^{\prime}}

19:

mindist\leftarrow\infty

idx\leftarrow-1

20: //

idx

determines the starting vertex for DFS in each strongly connected component in

G_{M}

21: for each

ver

v^{\prime}.subvertex

22: if

dist[ver]<mindist

then

23:

mindist\leftarrow dist[ver]

idx\leftarrow ver

24: end if

25: end for

26:

L_{G_{M}}

.append(DFS(

idx

v^{\prime}

))

27: //DFS() derives a traversal order in each strongly connected component in

G_{M}

as part of the final reverse topological sorting.

28: end for

29:end function

TABLE IX: Performance in different training data proportion

* Due to the limited size of MalS, the model cannot be fine-tuned in the training data proportion 2:1.
Proportion	BLEURT-sum	BLEU	ROUGE-L	METEOR	AVG summary length	BLEURT-sum variance
1:1	59.12	8.98	23.04	16.14	39.83	181.01
1:2	59.81	9.15	23.50	16.42	40.75	165.51
3:4	59.53	8.68	22.95	16.39	42.21	173.39
4:3	62.14	9.80	25.11	16.88	35.27	131.65
* AVG summary length is measured in words.

Algorithm 2 presents our function topological sorting algorithm, with the objective of constructing the reverse topological sorting of the function call graph $G_{M}$ . Due to the presence of multi-cycles (i.e., strongly connected components) in $G_{M}$ , it is not feasible to perform reverse topological sorting directly. To address this, we utilize the Tarjan algorithm to contract $G_{M}$ into a new directed acyclic graph $G_{M}^{\prime}$ , where reverse topological sorting can be applied to obtain an initial function traversal list $L_{G_{M}^{\prime}}$ , with each node representing a strongly connected component in $G_{M}$ .

Further, to achieve topological sorting for $G_{M}$ , we employ the DFS algorithm to sort each node within the strongly connected components. Since our aim is to reverse-sort the entire graph $G_{M}$ , with functions closer to the start function preferred towards the end of $L_{G_{M}}$ , we utilize the Dijkstra algorithm to compute the multi-source shortest paths $dist[]$ from all nodes with zero in-degree to other nodes (referred to as ’multi-source’ because the function call graph $G_{M}$ may not necessarily be connected). Subsequently, by traversing $L_{G_{M}^{\prime}}$ sequentially, for each strongly connected component, the node with the minimum $dist$ value is selected as the initial node $idx$ for DFS. Leveraging the inherent stack property of DFS, $idx$ ensures its placement at the end of $L_{G_{M}}$ . Finally, $L_{G_{M}}$ represents the desired function traversal order.

Appendix C Performance in different proportion of two-phase fine-tuning training data size

We sampled a total of 140,000 pieces of data in different proportions from the two datasets (MalS and BenignC) for two phases of fine-tuning. The results of the evaluation of model performance in each training data proportion are shown in the Table IX. In the experimental part of the text, we choose the ratio of 4:3 with the best effect as the actual training set ratio of Malsight.

Malsight: Exploring Malicious Source Code and Benign Pseudocode for Iterative Binary Malware Summarization

Abstract

Index Terms:

1 Introduction

2 Background

2.1 Malware Analysis Engineering

2.1.1 Binary Decompilation

2.1.2 Human Static Analysis

2.2 NLP Technologies

2.2.1 BERT Family

2.2.2 T5 Family

2.2.3 Transfer Learning

2.3 Code Summary Evaluation

2.3.1 Words’ Overlap Measure

2.3.2 Words’ Embedding Measure

3 Motivation and Overview

3.1 Code Summarization Process

3.1.1 Function List Extraction

3.1.2 Annotation Generation

3.1.3 Code LLM Summary

3.2 Evaluation Method

3.3 Datasets Construction

3.3.1 Dataset For Code Summary Model

3.3.2 Dataset For Evaluation Model

3.3.3 Dataset For Static Annotater

4 Code Summarization Workflow

4.1 Function List Extraction

4.2 Annotation Generation

4.2.1 Static Annotation

4.2.2 Dynamic Annotation

4.3 Building Malware Datasets

4.4 Code Summary Model

5 Evaluation Method

5.1 Evaluation Dataset Construction

5.2 Evaluation Model

6 Experiments

6.1 Experimental Setup

6.2 RQ1. Performance Test

6.2.1 Code Summary Performance

6.2.2 Annotation Extraction Effect

6.3 RQ2. Ablation Experiment

6.3.1 Module Ablation

6.3.2 Annotater vs. Stripping

6.4 RQ3. Evaluation Algorithm Test

6.5 RQ4. Real World Experiments

6.5.1 Human Evaluation Experiments

6.5.2 Performance From Different Evaluators

7 Case Study

8 Discussion

8.1 Ethics

8.2 Limitations

9 Conclusion

References

Appendix A Analysis of Traditional Similarity Evaluation Metric

A.1 Limitations of Overlap Based Methods

A.2 Limitations of Embedding Based Methods

Appendix B Reverse CFG Sorting Algorithm

Appendix C Performance in different proportion of two-phase fine-tuning training data size

Malsight: Exploring Malicious Source Code and Benign Pseudocode for Iterative
Binary Malware Summarization