VulMatch: Binary-level Vulnerability Detection Through Signature

Zian Liu14, Lei Pan2, Chao Chen3, Ejaz Ahmed4, Shigang Liu1, Jun Zhang1, Dongxi Liu4

1Swinburne University of Technology
{zianliu, shigangliu, junzhang}@swin.edu.com 2Deakin University
{l.pan}@deakin.edu.au 3Royal Melbourne Institute of Technology
{chao.chen}@rmit.edu.au 4Data61, CSIRO
{ejaz.ahmed, dongxi.liu}@data61.csiro.au

Abstract

Similar vulnerability repeats in real-world software products because of code reuse, especially in wildly reused third-party code and libraries. Detecting repeating vulnerabilities like 1-day and N-day vulnerabilities is an important cyber security task. Unfortunately, the state-of-the-art methods suffer from poor performance because they detect patch existence instead of vulnerability existence and infer the vulnerability signature directly from binary code. In this paper, we propose VulMatch to extract precise vulnerability-related binary instructions to generate the vulnerability-related signature. VulMatch detects vulnerability existence based on binary signatures. Unlike previous approaches, VulMatch accurately locates vulnerability-related instructions by utilizing source and binary codes. Our experiments were conducted using over 1000 vulnerable instances across seven open-source projects. VulMatch significantly outperformed the baseline tools Asm2vec and Palmtree. Besides the performance advantages over the baseline tools, VulMatch offers a better feature by providing explainable reasons during vulnerability detection. Our empirical studies demonstrate that VulMatch detects fine-grained vulnerability that the state-of-the-art tools struggle with. Our experiment on commercial firmware demonstrates VulMatch is able to find vulnerabilities in real-world scenario.

Index Terms:

vulnerability detection, software patch, source code, binary code, code signature

I Introduction

Finding vulnerabilities or bugs in software is vital to improve its quality. Vulnerabilities and bugs tend to inherit in new software products due to the sluggishness of making up-to-date patches. A vulnerability detection paradigm is learning from existing vulnerable codes to find similar vulnerabilities and bugs. Human security analysts can learn from hundreds or thousands of existing vulnerabilities to gain experience and improve security awareness to manually find vulnerabilities or bugs by reviewing the code [1]. However, with an extremely large number of codes to review, there is an urgent call for automated methods to identify vulnerability codes directly or filter out potentially vulnerable codes for human experts to review later. Moreover, automatic vulnerability detection is in high demand due to replicated vulnerabilities spread by code reuse as a common practice in the software industry [2, 3]. Detecting the 1-day or N-day vulnerabilities in binary code is vital because of the unavailability of source code in many real-world scenarios. This paper’s research question is how to effectively and efficiently find similar vulnerabilities or bugs from existing ones.

Automated detection methods have great advantages over manual analysis because binary code is notoriously difficult for humans to read and understand. Mainstream research consists of three genres: binary code similarity detection, patch existence detection, and vulnerability signature detection.

•

Binary Code Similarity Detection. Given a set of query binary samples, code similarity detection tools [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39] identify the best matching code snippets stored in the database with known vulnerable binary codes. Code similarity-based vulnerability detection finds vulnerable binary code but introduces excessive false positives because patched binary codes usually have high similarity scores. Furthermore, the similarity-based method is coarse-grained. They only output similar binary snippets at a large scale (e.g., function level). Since the function level binary code is usually large in scale and the vulnerability commonly only relates to several instructions, they can not explain specifically what instructions indicate the vulnerability.
•

Patch Existence Detection. This genre of work [40, 41, 4, 42] determines whether a patch exists in a query binary function. This genre of work extracts patch code signatures and detects the existence of patch signatures in the query function. However, it usually targets kernel binaries with debugging symbols like function names that are used to filter the query function and detect patch signatures. Moreover, this genre of work fails to address the existence of vulnerability because the lack of patches does not equal the vulnerability’s existence. In the National Vulnerability Database (NVD), some Common Vulnerabilities and Exposures (CVEs) are vulnerable from some versions in a series, suggesting that the versions before the consecutive vulnerable versions do not contain patched code. For instance, a project contains ten versions, but the versions between the third and the sixth are vulnerable, so its first two versions are not considered vulnerable because of the absence of patches.
•

Vulnerability Signature Detection. This genre of work [43, 44, 45] detects fine-grained vulnerability-related signatures in the binary code. Existing works extract different instructions between two binary reference versions (i.e., a vulnerable version and a patched version). Then they normalize the instructions and generate traces (i.e., blocks of normalized instructions) to form the vulnerability or patch signature. However, extracting signatures directly at the binary level could introduce instructions irrelevant to vulnerabilities because the compiler replaces instructions with the same semantic and inlines functions. For example, the source code line bool fromfile=FALSE; could be compiled to mov [rsp+48h+var_39], 0 in one version and xor r14d, r14d in other versions even using the same optimization flags (options). The same variable is stored on the stack in the former version and the register r14d in the latter version. If two versions inline another non-vulnerable function, and the inlined function has changed in the patched version, directly diffing the binary codes will include the changed instruction in the inlined function. We reproduced the methods to understand such inaccurate cases and manually analyzed the corresponding output binary signatures. We found that their methods introduce approximately 40% vulnerability-irrelevant instructions into the signatures.

We propose a novel approach to generate accurate and fine-grained vulnerability-related signatures to address those research gaps. Firstly, we spend significant manual efforts pre-processing the data to include all the CVEs’ information, each vulnerable function, the source code file it lies in, the affected versions, and the corresponding source code versions. To generate accurate and fine-grained binary code signatures, we generate source-code-level signatures and align them to binary-level signatures with the help of debugging information. Unlike existing work [43, 44, 45] that directly diff different binary versions to extract vulnerability signatures, we utilize source code to guide us to locate vulnerable binary code more accurately. Hence, we exclude many vulnerability-irrelevant binary code contents. To utilize non-trivial source-code information, VulMatch processes the source code to prevent vulnerable functions from being inlined. VulMatch locates binaries from source code by handling different situations as described in subsection III-B. Note that the source code is only required for generating the signature and not for matching a given binary. VulMatch aims to find vulnerabilities in the query binary, which should not contain any debugging information and source code. We combine the information of source code, binary code, and debugging information to generate (learn) the signature accurately. To match the binary-level signatures, we propose three signature types (i.e., add, delete, and change). To match the existence of fine-grained binary signatures rather than the whole-function-level similarity as the similarity-based genre, we create the binary signature with local control-flow information. The local control-flow information refers to the context instructions. It enriches the signature with unique features. To assist humans in understanding the decision made by VulMatch, a user interface interpreting the matched binary signatures shows the matched signatures and the match score in the binary.

To evaluate the utility of VulMatch, we prepared seven popular open-source projects with well-documented vulnerability information. In total, there are 906 CVEs, including 1281 vulnerable functions. Our results demonstrate that VulMatch outperforms two state-of-the-art vulnerability detecting tools — Asm2vec and Palmtree by approximately 9% and 6% more top-1 score, 80% and 79% less mismatch score, respectively. We also demonstrate how VulMatch assists humans in understanding its detection results in terms of interpretability. We experiment with commercial firmware to demonstrate VulMatch is practical to find real-world vulnerabilities. We perform in-depth research on the vulnerability and signature types and their distribution in the dataset.

This paper makes the following contributions:

•

We propose a novel approach to extract, store, and match the vulnerability-related signatures. We have implemented the approach into a tool called VulMatch that is open-source and publicly accessible on GitHub ¹¹1The source code is available at https://github.com/Vulmatch/Vulmatch.git .
•

To facilitate the human to understand VulMatch’s results and the reason VulMatch decides whether the query binary contains vulnerability or not, we provide interpretability functionality in VulMatch.
•

We perform in-depth analysis on vulnerability and signature types and their distribution across all datasets. We inspect each dataset’s top three vulnerability types and different signature types with the average signature size.

II A Motivating Example

Terms definition: Block refers to a set of consecutive binary (assembly) instructions split by the control-flow-related instructions (e.g., jump instructions). An assembly function consists of various blocks connected to each other. Blocks are connected together to represent the assembly code’s control-flow graph (CFG), as shown in Figure 2. Note that in this paper, we will use the terms ‘binary code’ and ‘assembly code’ interchangeably. CVE refers to the vulnerability in the function. One CVE may correspond to multiple vulnerability-related instructions and multiple signatures.

Refer to caption — Figure 1: An example vulnerable function tftp_connect selected from CVE-2019-5482. (a) lists pre-patch source code, and (b) lists post-patch source code. Green lines are the patched source lines. Other lines remain intact across the two versions.

Figure 1 shows the source code snippets of the vulnerable function tftp_connect from CVE-2019-5482 before and after the patch, where green lines are the patched instructions. Figure 2 shows the corresponding binary code structure. Figure 2(a) is the vulnerable version (before patch), and Figure 2(b) is the patched version. The binary code samples were built from the source code snippets using an identical compilation configuration with additional debugging information. Since the patch in the example comprises two kinds of changes through added and modified instructions, they are listed using different colors. Specifically, block 1’ is a modified block of instructions, and blocks 3’, 4’, 5’, and 6’ are added blocks of instructions. Other blocks remain intact.

Similarity-based lines of work compare the whole functions’ similarities before identifying a potentially vulnerable function if the function is similar to the vulnerable function. They focus on the whole function similarity rather than vulnerability-related instructions, resulting in poor granularity. Furthermore, they fail to distinguish the vulnerable and the patched functions since they are regarded as similar. Patch-detection lines of work first use the similarity lines of work to filter potential similar functions. They assume to select a similar function by name to detect the existence of the patch, where the binaries are Linux kernel binaries. However, as mentioned in section I, if the patch does not exist, it does not necessarily mean that the binary is vulnerable. Existing binary signature-based methods directly diff the vulnerable and the patched binary versions and assume the different binary instructions are all vulnerability-relevant. However, we reproduced their methods with a manual analysis of the results and found that up to 40% vulnerability-irrelevant instructions were included.

After manually inspecting the source code snippets and the corresponding binary samples, we found that only blocks 3’ and 6’ are the actual patched blocks corresponding to green lines in Figure 1. Other changed blocks (i.e., blocks 1’, 4’, and 5’) are not aligned with any changed source lines, but they map to the unchanged source code lines. The changes in blocks 1’, 4’, and 5’ were due to replacing instructions with the same semantics, which is the indirect impact of the patched instructions. Existing work in [43, 45, 44] failed to identify these blocks as unchanged code. To rectify this issue, VulMatch generates and matches the vulnerable signature with the guidance of the source code. We introduce the three steps of VulMatch as follows:

Step1: Locating Signature Instructions.: We use the diff tool to measure source-code-level differences. Diff can detect and output a list of changed sites, added sites, and deleted sites. In the example shown in Figure 2, diff scans the source code in Figure 1 and outputs one added site. Subsequently, we use the debugging information in the binary code (i.e., the source-binary lines mapping) to locate the patched binary lines. In the diff’s output, the changed site contains the source code lines in both pre-patching and post-patching versions. However, the diff output for add site only contains the added source lines in the post-patching version (e.g., green lines in Figure 1 (b)). Since the added instructions do not exist in the pre-patching version, diff has no outputs for the pre-patching version. Therefore, for add type signature, an additional process will take place later in step 2 to find vulnerability-related instructions in the pre-patch version.

Step2: Constructing Binary-level Signatures.: In this step, we use the located binary instructions in step 1 to construct the binary signatures. We store both vulnerable and patched signatures in the database. For the added signature, we still need to generate its vulnerable binary signature even if diff outputs nothing at the source-code level. We cannot directly consider the absence of the added (patched) instructions to imply vulnerability because another random function does not necessarily have the added (patched) instructions. The random function needs to be not vulnerable.

Therefore, we need to use the vulnerability-related instructions in Figure 2(a) to construct a vulnerability signature. Note that the added instructions are inserted between block 2 and block 7. Therefore, in the vulnerable version Figure 2(a), the control flow from block 2 to block 7 implies the vulnerability. In the patched version Figure 2(b), the control flow from block 2’ to block 3’ and from block 3’ to block 6’ implies patch existence. Therefore, we store the control flow from block 2 to block 7 as the vulnerable signature. Additionally, we store the control flow from block 2’ to block 3’ as the patch signature.

Step3: Matching Signatures.: For a query (unknown) binary, we check whether the vulnerability is related to CVE-2019-5482 stored in the database. If we store multiple signatures in the database for one CVE, we will check each signature and aggregate an overall score. For the changed or deleted signatures, we detect the percentage of the matched vulnerability instructions with respect to the query binary. For example, if the changed signature block contains 5 instructions and 3 of them exist in some block in the query function, then the score of the changed signature is 3/5=0.6. For the added signature, we check the existence of the control flow (e.g., the stored control flow from block 2 to block 7 in Figure 2(a)). We count how many matched instructions exist in the query function for each control flow. If there are 10 instructions in blocks 2 and 7, and we found a similar control flow in the query function with 8 instructions matched, then the score is 8/10=0.8. However, if we detect the existence of the patch signature in the query function, we directly consider the query function contains a patch and output that signature score as 0. Finally, we average all the signature scores according to their weights (instruction sizes) to derive the overall score.

To summarise, the input of our proposed method to produce the vulnerable binary signatures are: 1) CVE information, including the last vulnerable version, first patched version, and vulnerable function name. 2) Source code with different versions. Then, in the query phase, the input could be an unknown binary code without debugging information and source code. The output is a list of potentially matched CVEs with the similarity score. Compared to existing methods, VulMatch yields more accurate binary signatures with less vulnerability-irrelevant instructions. VulMatch is able to accurately predict the vulnerable sites in the query binary rather than only giving a similar code. VulMatch is able to accurately match the real vulnerable binary code with fewer false positives among several similar binary code snippets.

III Methodology

This section presents the design of VulMatch. VulMatch’s four components are shown as Figure 3.

III-A Data Preparation

We collect many already well-studied vulnerabilities from several publicly-available open source projects to build the vulnerability database. According to [44], vulnerabilities tend to be fixed in new versions of software releases. Thus, the vulnerability-related versions consist of the last pre-patching and the first post-patching versions. The last pre-patching and the first post-patching version will be used later to extract the signatures. We download all the vulnerability-related versions for each project and record each CVE’s information. Specifically, for each CVE, we record its related vulnerable source code file name and the vulnerability-related functions within them. We also record each CVE’s affect versions for later preparing testing binaries for evaluation.

Challenges: Not all vulnerability-related functions exist in the compiled binary code due to the automatic function-inlining behavior. Automatic function-inlining refers to merging a function FuncA into another function FuncB that calls back FuncA. If vulnerable functions are inlined, it would be challenging to locate them in the binary code. This case holds even if we manually turn off the function-inline option during compilation. Hence, it is challenging for us to generate binary signatures.

Solution: We need to ensure that the database contains no inlined functions in the compiled binaries. VulMatch automatically analyzes the source code files and edits the functions in the source code files to inform the compiler not to inline the function. Technically, VulMatch inserts a non-inline tag __attribute__((noinline)) before each vulnerable function in all related versions to preserve the tagged functions in the compiled binary code. For each CVE, VulMatch loads the CVE’s information to retrieve its vulnerable source code files along with the corresponding vulnerable functions. Then for each related version (i.e., the last pre-patching version and the first post-patching), VulMatch analyzes the vulnerable source code file to locate the vulnerable functions and automatically insert no-inline tags. Finally, we compile these versions into binaries with the same default compilation options.

III-B Locating Signature Instructions and Challenges

We generate signatures related to vulnerabilities and patches using the source codes and compiled binary codes. For each vulnerable function, we generate its signatures in two steps — 1) generate source-level vulnerability-related instructions, 2) locate vulnerability-related binary instructions through mapping.

III-B1 Generating Source-level Vulnerability-related Instructions

We prepare the last pre-patching and the first post-patching versions using the information we retrieved in subsection III-A. Subsequently, we generate vulnerability and patch-related signatures on the source code level. We use the diff tool²²2https://man7.org/linux/man-pages/man1/diff.1.html to extract source-code-level patched instructions. There are three types of source-code-level patches in the diff outputs. 1) Added instructions that are used in the patched version and absent in the vulnerable version, as shown in 4(a). 2) Deleted instructions that are removed from the vulnerable version and absent in the patched version as shown in 4(b). 3) Changed instructions that are updated from the vulnerable version to the patched version, as shown in 4(c). The changed instructions usually share the same context instructions among the two versions.

III-B2 Locating Vulnerability-related Binary Instructions through Mapping

We use the source-to-binary mapping with the binary’s debugging information to locate the source code’s corresponding assembly instructions. Although VulMatch employs the simple idea, there are practical challenges primarily in two aspects.

1.

Asymmetric source-binary mapping: it is challenging to map source line changes in the source code files (e.g., .cpp or .c file) to the corresponding binary file,
2.

Identification of vulnerability-specific source lines.

Two challenges to map source code files:

•

Challenge1: Asymmetric source-binary mapping. Not all the source code lines have a matching binary code instruction. For example, Figure 4 shows an example of missing mapping between source code and binary code. Lines 1226 and 1228 declare new variables but do not map to any binary instructions because the variables at the binary level are directly used without explicit type declaration due to the binary code convention.
•

Solution1: Generally, the source code lines declaring new variables (e.g., line 1226 and 1228 in Figure 4) do not have a mapping binary code because of binary code convention. However, it does not affect finding the binary signatures. We further elaborate on the following two cases: 1) If a new variable declaration is added, it must be used later in some other source code lines, implying that the correlated source lines still exist after diffing source codes of the patched and vulnerable functions. 2) If a variable’s name is changed, the source code referring to that variable must change, which is detected by diffing the source codes. For a variable with type change (e.g., change from a defined structure structA to an updated structure structA’), source code lines using that variable tend to change because of different type usage (e.g., defining different fields in the different structure type).

•

Challenge2: Identification of vulnerability-specific source lines. The add type signature is challenging to represent. Because the add type signature only exists in patched versions, the added instructions imply the existence of a patch rather than the vulnerability itself. Therefore, there are no direct vulnerable instructions from the vulnerable version. For example, 4(a) shows an example of the add type signature in the source-code level. Green lines (lines B2 to B6 on the right-hand side) are the added lines in the patched version, and grey lines are the unchanged lines across the two versions. The absence of the green lines in the vulnerable version implies a vulnerability. However, other random functions may lack added instructions without the same vulnerability. Therefore, the lack of added instructions cannot be directly used as the vulnerable signature. We need to infer the vulnerability signature in the vulnerable version to detect vulnerability existence.
•

Solution2: To represent add type vulnerability signature, our solution is to focus on the context. For example, lines A1 and B1 in 4(a) are unchanged in the two versions. A1 flows to A2 in the vulnerable version, while B1 flows to B2 in the patched version. The control flow from the unchanged instruction A1 to the following instruction A2 is regarded as the vulnerability signature in VulMatch. Conversely, the control flow from B1 to B2 is regarded as a patch signature. Since the added instructions are inserted at some point within the function, they must have identical context instructions (e.g., A1 and B1 in the example) with different subsequent instructions (e.g., A2 and B2). For simplicity, we explain this concept at the source code level. But we extract add type signatures at the binary level. For more details refer to subsection III-C.

III-C Constructing Context-aware Binary-level Signatures

We construct the binary-code-level signatures before storing them in the database for signature matching. Simply storing the sets of instructions in the database as vulnerable signatures and detecting those signatures’ existence in the query binary code may not be beneficial. As mentioned in subsection III-B, added instructions in the patched binary cannot directly be used to form a vulnerability signature because it only indicates patches. The term context refers to the adjacent blocks’ instructions of the vulnerable binary instructions. The vulnerable binary instructions are usually short. If we generate signatures by simply concatenating those instructions into a sequence, the signature may carry inadequate information to prevent false positives. Therefore, we propose to form new structures by combining the context and the vulnerable instructions. Our newly combined structure gives the signature adequate uniqueness to boost the performance of signature matching. We propose to build the context around the vulnerable signature instructions through generalization to reduce false positives. For instance, the extracted signature instructions size is small (e.g., only 3 instructions). Checking the existence of signature instructions without context information makes the signature not unique enough, leading to excessive mismatches (false positives).

Since the added instructions in the patched version have blocks directly preceding them, the counterpart preceding blocks in the vulnerable version should have different instructions following them. Therefore, we capture local control flows around the preceding blocks in the vulnerable version to represent the vulnerability signature.

We propose to generate the binary-level signature with control-flow information. Firstly, we define several terms.

•

Add Batch. When newly added source code snippets are mapped to binary code blocks, the newly added blocks could either be directly connected to each other (e.g., block 4 and 5 in Figure 6) or separate from each other (e.g., block 4 and 9 in Figure 6). An add batch is made up of the added blocks that are strongly connected to each other. As shown in Figure 6, block $(4,5,6)$ and block $(9,10)$ are two add batches.
•

Leading Blocks. The leading block is the unchanged block immediately preceding an added batch. As shown in Figure 6, blocks 1 and 7 are two leading basic blocks because they immediately precede two add batches.
•

Parents-children Structure. We define a parents-children structure to store the control flow and literal information for add and change signatures. Specifically, in one parents-children structure, we have an initial block from the function as the parent. We include the chosen block’s children blocks in the function into the parents-children structure. Conversely, we can select a child block before including its parents to establish a parents-children structure.
•

Block List Structure. We define a block list structure to store only the literal information when control-flow information is not available or unnecessary. In one block list structure, we store all the vulnerable binary instructions grouped by blocks.

We store both vulnerable and patch signatures. Vulnerability signatures are generated from the instructions in the vulnerable version. This signature type consists of parents-children structures or block list structures. Patch signature consists of the instructions that only exist in the patched version and only consists of the block list structure. Patch signature is used to reduce the false positives further. Despite the vulnerability match score, the patch signature directly implies a patch. The vulnerable signatures contain three types: 1) add, 2) delete, and 3) change. Those signatures have different structures to capture different information because different signature type has different nature. We capture various information for different signature types to enrich the signature information.

For the add type signature, to locate the add type binary signatures, we A) retrieve the added binary instructions in the patched version (i.e., the output of the operations in subsection III-B). B) We find the leading basic blocks in the patched version binary. C) We find its counterpart leading basic block in the vulnerable version binary. D) We include the vulnerable binary’s leading basic blocks’ children blocks as a parents-children structure in the signature.

For the delete type signature, we directly locate the mapping binary instructions and store those instructions into block list structures as delete signatures since the deleted instructions usually map to multiple blocks. Since the mapped blocks are usually sufficient in amount, lexical information already makes the signature unique for matching. If we record their control-flow information we will use excessive parents-children structures. We exclude any patch signature for this signature type because the patched version does not have any unique instruction that does not exist in the vulnerable version.

The change type signature has two categories, including one-block-change and many-blocks-change. Many-block-change means the changed instructions are distributed in multiple blocks (i.e., distributed in neighbor blocks or blocks that are not directly connected). One-block-change is the case if all the changed instructions are accommodated in one block in the binary code.

TABLE I: Information of the seven selected open-source projects.

Project	Domain	Versions(#)	Binary Files(#)	.c Files(#)	.h Files(#)	CVEs(#)	Vulnerable Functions(#)	Avg Size
Tcpdump	Packet Analyzer	20	152	167	78	192	213	20.45
Curl	Data Transferring	67	315	419	197	111	231	44
OpenSSL	Protocols	51	755	903	243	114	220	205
Openjpeg	Image Processing	15	104	205	139	94	187	24.50
LibPNG	Image Processing	63	39	36	14	52	50	6.90
Libtiff	Image Processing	30	69	102	24	142	169	12.30
FFmpeg	Multimedia Processing	104	1206	1591	629	201	211	584
Total	various	350	2640	3423	1324	906	1281	897.15

Many-block-change: If the change is many-block-change, we will need to record both control-flow and lexical information in the database since the change sites are usually small in size. This category of signature provides rich information as it contains adequate lexical information (i.e., binary instructions) from multiple blocks or control-flow information between those blocks. Therefore, for each block in a many-block-changes structure, if its neighbor (i.e., either predecessor or successor block) is a change block, we include this neighbor to form a parents-children structure. If none of its neighbor blocks is changed, all changed instructions are grouped as a block in the signature. Note that if the many-block-change contains a deeper level other than two levels (i.e., the level of parents-children structure), we use multiple parents-children structures to cover all the strongly connected blocks. For example, if block $A$ flows to block $B$ , and block $B$ flows to block $C$ , we will have two parents-children structures to cover the flow from $A$ to $B$ and from $B$ to $A$ respectively.

One-block-change: Conversely, if the change is a one-block-change, the information is limited because we only have lexical information without control flow information. Thus, we need to add more control flow information to enhance the signature and reduce potential mismatch. We include its parent blocks in a parents-children structure to enhance the signature. We include the children blocks in the parents-children structure if it has no parent block.

Patch signatures: We generate signatures for patches. After we generate vulnerable signatures as above, we diff the vulnerability-related sites in both versions. We identify the instructions that only exist in the patched version and store it using a block list structure as the patch signature.

III-D Signature Matching

We detect the vulnerability’s existence by using both vulnerability and patched signatures. For the add signature, we search for each vulnerable parents-children structure in the query binary code. Then, we check for the existence of a patch signature. If a patch is found, the function is directly considered patched. For the delete signature, we search for the existence of the blocks from the block list structures. We do not match patch signatures for the delete type because the delete type does not has unique instructions in the patched version. For the change signature, we search for the existence of each parents-children structure or block list structure in the query binary code. Subsequently, we check the existence of the patch signature. If the patch is found through a query, the function is considered patched (denoted by $P=1$ ); otherwise, $P=0$ .

We propose a measurement of the vulnerability existence score ( $Sim$ ) to demonstrate the probability of the query function containing a given vulnerable signature. Specifically, a final score of vulnerability existence is calculated as follows:

Sim=\begin{cases}\frac{\Sigma_{i=1}^{len(S)}Matched(S[i])}{\Sigma_{i=1}^{len(S% )}Total(S[i])}&\text{if }P=0\\ 0&\text{if }P=1\end{cases}

where $Sim$ represents the result similarity score to the vulnerable signature. $S$ represents one vulnerable signature. A signature consists of one or multiple structures (a structure is either parents-children structure or block list structure). $len()$ calculates the number of structures regarding an input signature. $S[i]$ represents a structure. $Matched()$ calculates the number of instructions matched between the input structure and the given query binary function. If the structure is parents-children structure $PS$ , $Matched()$ searches through the query binary to find the similar parents-children structure $PS^{\prime}$ with the maximum similarity. Then $Matched()$ counts the instructions shared between $PS$ and $PS^{\prime}$ . If the structure is a block list structure, $Matched()$ finds all the blocks with the maximum similarity to each block in the block list structure before $Total()$ aggregates the total instruction number of the input structure.

IV Evaluation

IV-A Experimental Setup

Data Collection: We collected source code for seven open-source projects, including OpenSSL, OpenJPEG, FFmpeg, TCPDUMP, LibTIFF, cURL, and LibPNG. These projects are selected from diverse domains like communication protocols, image processing tools, and network traffic analyzers. After manual analysis, we extracted 906 CVEs corresponding to 1,281 vulnerable functions. Table I lists the versions, application domains, CVE information, vulnerability, and code-related information.

Baseline Tool Selection and Testbed: We prepared two state-of-the-art baseline tools Asm2Vec [28] and PalmTree [46], because of their popularity and excellent performance in vulnerability detection. We ran VulMatch and Asm2vec on an Intel NUC kit (NUC8i5BEH) with an i5-8259U processor and 16 GB memory. Since Palmtree is a deep learning-based approach and requires intensive GPU power, we ran it on an accelerator cluster of high-performance computer (HPC) systems with 456 NVidia Tesla P100, 114 Dual Xeon 14-core E5-2690, and 256 GB memory.

Project Compilation: As mentioned in subsection III-A, we compile all the versions relating to each vulnerability (i.e., the last version before patching and the first version after patching) of the project to generate binary code instances. Depending on the project, we use the projects’ default compiling flags, either -O2 or -O3. For each project, we use identical compiling flags for building. So when we diff the compiled binary code to generate vulnerability and patch signatures, the compiling options are the same. This minimizes the differences in binary codes and is the common practice as [45, 44] to help find vulnerable instructions. At compile time, we set the debugging symbol option to acquire source-binary instructions mapping that will serve as ground truth.

Research Questions: In the first experiment, we compare VulMatch with two state-of-the-art baseline tools to evaluate how well they find known binary code vulnerabilities. In the second experiment, we test how VulMatch interprets the found similar vulnerabilities and how VulMatch assists humans in understanding the reason it considers the query binary vulnerable. In the third experiment, we match vulnerabilities in real-world firmware binaries to test how VulMatch work in a real-world application. In the fourth experiment, we investigate how diverse types of proposed vulnerability signatures (i.e., add, delete, and change) distribute.

IV-B Performance Metrics

Top-1 Score: Each vulnerable function was patched after a certain version. And all the versions or a range of function versions are vulnerable before that specific version. Therefore, we select a vulnerable binary function $f$ from binary code $\mathcal{B}$ to test how the tools discover similar vulnerabilities. We construct the vulnerable and patch signature of $f$ from the last pre-patch version and the first post-patch version and store the signature in the database. To test how well the signature in the database can be matched, we prepare a binary version (denoted by $\mathcal{B}_{v}$ ) containing the vulnerable version of $f$ (denoted as $fv$ ), and a patched version binary (denoted by $\mathcal{B}_{p}$ ) containing the patched version of $f$ (denoted as $fp$ ) for testing purpose.

$\mathcal{B}_{V}$ and $\mathcal{B}_{P}$ should differ from the versions that generate the binary signature. $\mathcal{B}_{v}$ and $\mathcal{B}_{p}$ contain many functions, including the vulnerable and patched version of $f$ , and other functions. For vulnerable function $f$ , we inspect each function $fi$ in both $\mathcal{B}_{v}$ and $\mathcal{B}_{p}$ to derive a match score indicating the percentage that $fi$ is similar to $f$ ’s vulnerable signatures. $fv$ in $\mathcal{B}_{v}$ should have the highest score among all other functions; conversely, $fp$ and all other functions in $\mathcal{B}_{p}$ should have low match score. It is reasonable for $fp$ in $\mathcal{B}_{p}$ to have a higher score than other functions in $\mathcal{B}_{p}$ since $fp$ is patched from $f$ . Nevertheless, $fp$ should be lower than $fv$ ’s score. We use the top-1 score to measure the rate of ranking ground truth vulnerable function in the first place.

We provide a simple example of how the top-1 score works in VulMatch. Suppose there are ten vulnerable functions, each with a vulnerable and a patched binary version ${B}_{v}$ and ${B}_{p}$ . ${B}_{v}$ contains $fv$ . ${B}_{p}$ contains $fp$ . Both ${B}_{v}$ and ${B}_{p}$ also contain many other functions. We match the vulnerable signature of $f$ in the database with each function in both ${B}_{v}$ and ${B}_{p}$ . If vulnerable function $fv$ has the highest score, we rank $fv$ at the top-1 place. If 8 out of 10 vulnerable functions rank their testing vulnerable version $fv$ in the top-1 place, then the top-1 score is 0.8.

Mismatch Score: Merely referring to the top-1 score partly reflects how accurately the tools distinguish ground truth vulnerable functions from other functions. However, the top-1 score cannot well demonstrate how the tools consider the non-ground-truth vulnerable function as non-vulnerable. A tool that identifies vulnerable functions well with a high top-1 score may not identify non-vulnerable functions as not vulnerable well. If non-vulnerable functions have extremely close match scores to vulnerable functions, this leads to a high mismatch score.

For instance, some tools may output a similarity score of the ground truth vulnerable function as 0.98, while the score for the ground truth patched function or another random function is 0.97. In this case, even though the ground truth function is ranked first, the two scores are too close to reaching the final verdict. The ground truth vulnerable function should have a significantly higher score than any other function. If any non-ground-truth vulnerable function has a score close to or higher than the ground-truth vulnerable function, the function receives a non-zero mismatch score.

The mismatch score indicates the reliability of the top-1 score. To keep track of the mismatch score of each vulnerable function, the $\alpha$ parameter is a threshold to activate the mismatch score. We consider it a mismatch for any non-vulnerable function with a threshold above $S_{GV}-\alpha$ , where $S_{GV}$ denotes the ground-truth vulnerable function score. If $S_{GV}$ has an extremely low score (e.g., near zero), any non-vulnerable function having a score close to or above $S_{GV}$ is not considered a mismatch. Because the root failure occurs in detecting a vulnerability function rather than non-vulnerable functions, we set $S_{GV}<0.6$ as an extremely low score.

IV-C Vulnerability Detection

Since VulMatch aims to find replicated known vulnerabilities, and the two baseline tools Asm2vec and Palmtree find vulnerabilities based on binary code similarity, we compare VulMatch with those two baseline tools to test their performance on the first objective — finding replicate vulnerabilities. Moreover, the query binary is non-deterministic in real-world scenarios since it could be either a vulnerable or patched version. Thus, it is vital to the second objective — differentiate vulnerable functions from other non-vulnerable functions. Therefore, we design this experiment to test these two goals concurrently.

Vulnerable Function Detection Accuracy: Table II lists top-1 scores of VulMatch, Asm2vec, and Palmtree on the seven selected projects. Regarding the top-1 score, VulMatch outperforms both baseline methods in six projects and is marginally lower than Palmtree on OpenSSL. VulMatch ranks multiple testing vulnerable functions at the top after it extracts accurate vulnerability signatures and matches vulnerable and patched signatures. It is because VulMatch matches the fine-grained vulnerability-related instruction (signature) rather than coarsely matches the whole function. Since the vulnerability signature tends to be small snippets of instructions, matching the whole function similarity fails to detect such fine-grained information.

TABLE II: Top-1 scores of seven open-source projects. A higher score indicates a better performance.

Project	Openjpeg	FFmpeg	Tcpdump	Libtiff	curl	LibPNG	OpenSSL
Asm2vec	0.673	0.643	0.702	0.675	0.821	0.815	0.536
Palmtree	0.573	0.643	0.702	0.779	0.840	0.837	0.691
VulMatch	0.791	0.714	0.786	0.825	0.872	0.859	0.673

Non-Vulnerable Function Detection Accuracy: Table III lists mismatch measurements of VulMatch, Asm2vec, and Palmtree with respect to different $\alpha$ values. In the mismatch score perspective, VulMatch achieved the best result with the lowest mismatch score, indicating that VulMatch differentiates the vulnerable version and patched versions with the highest confidence level. Conversely, Asm2vec and Palmtree had high mismatch scores, indicating that many decisions between vulnerable and patched versions were made with low confidence. Since $\alpha$ denotes the threshold distance to the vulnerable version and $S_{GV}<0.6$ , we vary $\alpha$ between 0.1 and 0.4 to obtain positive mismatch scores. Our evaluation results empirically suggest that $alpha=0.1$ yields the best result.

For FFmpeg, VulMatch achieved mismatch scores as 0 compared with baseline methods’ mismatch scores of approximately 1. For other projects, there are huge contrasts between VulMatch and baseline tools. VulMatch outperforms two baseline tools because VulMatch derives vulnerable signatures from the vulnerable and patched versions. Another reason is VulMatch matches the fine-grained vulnerability signature rather than the coarse whole function similarity. Therefore, subtle vulnerability-related differences are accurately identified, which is superior to whole-function-level similarity matching.

IV-D Interpretability

When finding vulnerable functions, a tool’s interpretability is as important as its high accuracy. In practice, vulnerability-detecting tools assist human experts in making a final verdict. Therefore, a good tool should clearly explain why a query result is considered vulnerable. Unfortunately, the state-of-the-art baselines fail to provide good interpretation functionality. Palmtree outputs only the overall similarity score between the query function and the functions stored in its database. In addition to the overall similarity score, Asm2vec lists similar instructions for the query. Asm2vec fails to highlight the vulnerability-related instructions; instead, it highlights the whole function as different or similar.

TABLE III: Mis-match scores of seven open-source projects. A stands for Asm2vec, P for Palmtree, and V for VulMatch. A lower score indicates a better performance.

$\alpha$	Project	Openjpeg	FFmpeg	Tcpdump	Libtiff	curl	LibPNG	OpenSSL
0.1	A	0.700	0.929	0.881	0.968	0.949	0.924	0.945
	P	0.909	0.821	0.905	0.955	0.750	0.946	0.936
	V	0.091	0.000	0.190	0.162	0.038	0.098	0.118
$0.2$	A	0.836	0.964	0.988	1.000	1.000	1.000	1.000
	P	0.982	1.000	0.940	1.000	0.885	1.000	0.955
	V	0.127	0.000	0.286	0.260	0.103	0.152	0.127
$0.3$	A	0.900	1.000	1.000	1.000	1.000	1.000	1.000
	P	0.991	1.000	0.940	1.000	0.974	1.000	0.955
	V	0.155	0.000	0.429	0.312	0.147	0.163	0.182
$0.4$	A	0.962	1.000	1.000	1.000	1.000	1.000	1.000
	P	1.000	1.000	0.976	1.000	1.000	1.000	0.964
	V	0.173	0.000	0.500	0.383	0.224	0.196	0.218

Figure 7 demonstrates an example of VulMatch’s interpretability. This example is a many-block-change vulnerable signature matching selected from CVE-2016-9117. The signature (left-hand side) was extracted from the imagetopnm function with versions 2.1.2 and 2.2.0. The matched instructions (right-hand side) in the query binary are from version 2.1.1. For the selected signature, there are 23 instructions in all structures, and 19 of them are matched. The unmatched instructions mov rcx, qword ptr [rip + 0xfc246], mov rcx, qword ptr [rip + 0xfc22b] on the left-hand side and the instructions mov rcx, qword ptr [rip + 0xfc236], mov rcx, qword ptr [rip + 0xfc21b] on the right-hand side have different offsets due to structure fields are changed. Note that this vulnerable function has multiple signatures, and we omit others for clarity. The overall match score combining all signatures exceeds 0.867, indicating VulMatch’s high confidence level for the verdict.

IV-E Real-world Vulnerability Detection

Since IoT devices’ firmware reuse open-source projects, they often contain 1-day vulnerabilities. In this experiment, we evaluate how effectively VulMatch detects a real-world 1-day vulnerability in an IoT device’s firmware. We select four IoT devices’ firmware instances (i.e., DCS-3511, DCS-6517, DCS-7517, and DCS-6915) collected in the wild. We manually analyze the firmware and prepare 36 ground-truth 1-day vulnerabilities, including 52 vulnerable functions. We generate the vulnerability binary code signatures and store them in the database. For each vulnerable signature in the database, we detect it against each function $Fi$ in the firmware and assign a matching score for $Fi$ . If the $Fi$ with the top score is the ground-truth vulnerable function, a vulnerable function is correctly detected. VulMatch correctly detects 40 out of 52 (77%) vulnerable functions. Again, the high accuracy in finding real-world replicate vulnerabilities is due to VulMatch’s concentration on the fine-grained vulnerable instructions along with the local control-flow information. We manually analyzed the failed case and found two main failure causes: 1) The binary code contains other function(s) with high similarity to the vulnerable one. 2) The testing binary code contains different structure fields thus at the binary level, the offsets of the structures are different from the signature in the database. For example, [esi+0x40] changed to [esi+0x48] where esi is the memory address of the structure. The same field changed from offset 0x40 to offset 0x48 because of adding or deleting other fields in the structure.

IV-F Statistics of Signature Distributions

TABLE IV: Vulnerability CWE types of the seven open source projects.

	openssl	openjpeg	libtiff	libpng	ffmpeg	curl	tcpdump
NVD-CWE-Other	0.15	0.01	0.04	0.15	0.03	0.00	0.00
CWE-399	0.12	0.00	0.01	0.12	0.08	0.00	0.00
CWE-310	0.12	0.00	0.00	0.00	0.00	0.00	0.00
CWE-787	0.02	0.15	0.10	0.00	0.02	0.00	0.01
CWE-119	0.11	0.33	0.41	0.31	0.37	0.00	0.27
CWE-190	0.01	0.10	0.03	0.02	0.01	0.00	0.01
CWE-125	0.03	0.06	0.13	0.02	0.03	0.04	0.60
CWE-189	0.04	0.02	0.06	0.19	0.13	0.00	0.02
NVD-CWE-noinfo	0.01	0.01	0.00	0.08	0.14	0.00	0.01
CWE-126	0.00	0.00	0.00	0.00	0.00	0.13	0.01
CWE-122	0.00	0.05	0.00	0.00	0.00	0.09	0.00
CWE-305	0.00	0.00	0.00	0.00	0.00	0.09	0.00

In this experiment, we investigate the distribution of the vulnerability according to 1) the Common Weakness Enumeration (CWE) type and 2) our defined three types (i.e., add, delete, change). Table IV lists the vulnerability distribution according to different CWE types. Specifically, we select the three most popular CWE types for each project and concatenate them into the table. We observe that Improper Restriction of Operations within the Bounds of a Memory Buffer (CWE-119) is the most common vulnerability type in our experiment (5 in 7 projects). Curl contains the most CWE vulnerability types (43 types), while LibPNG contains the least CWE types (11 types).

Table V shows the distribution of the four types of vulnerability signatures (i.e., add, delete, one-block-change, and many-block-change). Originally, there were three types (add, delete, change). We further split the change type into one-block-change and many-block-change for clarity. Sig (#) refers to the number of the signature type in the project. Avg. size refers to the average instruction amount of the specific signature in the project for each CVE. Generally, many-block-change is the dominant type in all datasets. The delete type is the least common type in all datasets. The add type contains the most instruction size because the add type involves at least two complete basic blocks to form the signature. Conversely, the delete type contains the least instruction size because the delete type does not contain control-flow information between multiple blocks that are made up of separate blocks. The change types may consist of parent-children structures or separate blocks.

TABLE V: Statistics of signatures in the 7 open source projects. OBC for one-block-change, and MBC for many-block-change.

		openssl	openjpeg	libtiff	libpng	ffmpeg	curl	tcpdump
add	sig (#)	120	75	48	10	69	74	248
add	Avg. size	50	77	257	46	185	103	61
delete	sig (#)	37	17	9	1	9	20	64
delete	Avg. size	6	3	6	9	10	4	6
OBC	sig (#)	109	172	61	27	61	85	114
OBC	Avg. size	6	11	10	7	14	6	7
MBC	sig (#)	146	258	91	27	61	109	269
MBC	Avg. size	14	23	24	9	20	14	12

V Discussion

Require Source Code: Compared to three state-of-the-art works [43, 45, 44], we require both source code and binary code to extract the signature. All of the three tools [43, 45, 44] claim to only require binary code, but they require all the vulnerability-related versions of binary code, and the binary code must be compiled with the same optimization flag. This assumption is strong because one can not guarantee the binary versions (s)he collected from the wild are compiled with the same options. Therefore, in their actual implementations, they still need the source code to generate different binary codes with the same optimization options from which a signature is extracted.

Cross Architecture: VulMatch only investigates the vulnerable and patched code on the same architecture. However, the same source code could be compiled on different hardware architectures (e.g., ARM, x32, PowerPC, etc.) How to match cross-architecture vulnerable signatures remains an open research problem. Possible solutions include: 1) translating different architectures’ instructions into an intermediate language, and 2) extracting vulnerable binary signatures on different architectures. However, this issue is beyond this paper’s scope.

Differences Introduced by Compilation: An important challenge is mitigating instruction differences introduced by different compiling optimization settings, different compilers, and different compiler versions. This paper only considered the project’s default optimization options and our testing system’s default compiler. It is possible to observe the binaries compiled with different optimization levels or compilers in the wild. A plausible solution is to utilize symbolic execution to mitigate the impact of different optimization levels as [40]. However, symbolic execution is time-consuming to execute. Another possible solution without changing our current methodology is to increase our training data. The training data refers to the binaries we extract signatures from. Since we only extract vulnerability signatures from vulnerable and patched versions compiled by their default optimization level and the default compiler, the current training data are limited. To detect cross-optimization-level or cross-compiler signatures, a possible solution is to compile the project using multiple optimization levels or compilers and extract their corresponding signatures.

Patch and vulnerability detection genres of work directly extract assembly instructions and form signatures. The state-of-the-art whole-function similarity matching adopts many data-driven methods. Asm2vec [28] and Palmtree [46] convert the assembly instructions into vectors to mitigate subtle assembly differences introduced by compilations to some extent. Data-driven methods usually take less time than other methods. Merging these two methods by generating vectorized fine-grained signatures detects fine-grained signatures and mitigates assembly differences with less time and cost. Graph attributes-based vectors are generated in [22, 23]. Therefore, it is possible to extend VulMatch by incorporating fine-grained graph-based embeddings as the signature.

VI Related Work

We present the related work from the following threefold since they are closely related to this work: 1) code similarity detection, 2) patch analysis, and 3) vulnerability detection.

VI-A Code Similarity Detection

VI-A1 Binary-code-level similarity detection

Binary-code-level similarity works are categorized in two directions according to their methods.

Learning-based methods: Binary code instructions are encoded into an embedding to compare the similarity. Gemini [22], Vulseeker [23], and Genius [14] use graph feature embeddings to determine vector similarity. Safe [25], InnerEye [26], $\alpha$ Diff [27], Kam1n0 [7], and Asm2Vec [28] learn the instructions’ embeddings and generate block embeddings or function embeddings.

Program-analysis based methods: Instructions or blocks are regarded as sequences in Binsequence [11] and Tracy [12] using sequences-alignment methods to compare the similarity. Similarly, SIGMA [20], FOSSIL [19], and Beagle [18] rely on the instruction semantic categorizations like data transfer, logic, or arithmetic. Bingo [32] and IMF-SIM [33] use input-output relations to measure binary code similarity. Expose [10], Binhash [30], Binhunt [34], CoP [21], ESH [35], GITZ [36], and XMATCH [39] symbolically execute the binary code before the similarity comparison based on symbolic formulas.

Limitations: However, similarity-based methods match the whole function similarity. Vulnerable instructions only involve several lines of code in the function. Therefore, the similarity-based method can filter similar functions but cannot distinguish whether the function is vulnerable.

VI-B Patch Identification and Analysis

FIBER [40] detects patch existence in Linux kernel binaries based on symbolic execution. Using symbolic execution and memory status, PDiff [41] detects Linux kernel binaries’ patch existence when binaries are different due to patch customization, different build configuration, and other reasons. Spain [4] uses binary-level semantic information to identify the patch before summarizing patch and vulnerability patterns. Patchscope [42] identifies patch existence based on memory-object-centric methods and dynamic execution.

Limitations: This category of prior work assumes that the function names are provided or that some similar candidate functions have already been selected by the code-similarity-based method. Moreover, they focus on patch detection rather than vulnerability detection. The lack of a patch does not necessarily imply that the function is vulnerable.

VI-C Vulnerability Detection

VMPBL [43] builds a database storing vulnerable and patched functions to distinguish the pre-patch and post-patched functions. VIVA [44] collects binary with versions before and after the patch and directly diff the pre-patch and post-patch functions to retrieve binary-level vulnerability signatures. VIVA further detects vulnerability existence based on pre-filtering and instruction clustering. BINXRAY [45] requires pre-patch and post-patch version binaries to analyze the vulnerability-related instructions in both versions before storing instructions in the database as vulnerability and patch signatures. BINXRAY checks the vulnerability’s existence in a query function based on its closest signature version.

Limitations: This genre of work is most similar to our methods. However, they assume all the different binary codes between versions are related to the vulnerability, this often introduces many vulnerability-irrelevant instructions into signatures.

VII Conclusion

In this paper, we proposed a novel approach called VulMatch to extract and match binary-level vulnerability-related signatures. VulMatch consists of four steps: 1) data preparation, 2) locating signature instruction, 3) constructing context-aware binary-level signatures, and 4) signature matching. Compared to previous work, VulMatch accurately locates vulnerability-related instructions and detects vulnerability within functions. Through our empirical studies, VulMatch outperformed two state-of-the-art similarity-based vulnerability detection tools — Asm2vec [28] and Palmtree [46]. Specifically, VulMatch achieved the most accurate results on six out of seven projects with the least ambiguities while providing reasons for vulnerable functions. Hence, VulMatch effectively facilitates human understanding of its decision process during vulnerability detection. Our experiment on real-world firmware vulnerability detection indicates VulMatch is practical to find vulnerabilities in real-world scenarios. Our analysis of vulnerability distributions confirmed that VulMatch is a versatile detector with good potential for future extension.

References

[1] J. Heffley and P. Meunier, “Can source code auditing software identify common vulnerabilities and be used to evaluate software security?” in Proceedings of the 37th Annual Hawaii International Conference on System Sciences. IEEE, 2004, pp. 21:1–10.
[2] S. Haefliger, G. Von Krogh, and S. Spaeth, “Code reuse in open source software,” Management Science, vol. 54, no. 1, pp. 180–193, 2008.
[3] ——, “Code reuse in open source software,” Management Science, vol. 54, no. 1, pp. 180–193, 2008.
[4] Z. Xu, B. Chen, M. Chandramohan, Y. Liu, and F. Song, “Spain: Security patch analysis for binaries towards understanding the pain and pills,” in Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 2017, pp. 462–472.
[5] M. R. Farhadi, B. C. Fung, P. Charland, and M. Debbabi, “Binclone: Detecting code clones in malware,” in Proceedings of the 2014 Eighth International Conference on Software Security and Reliability (SERE), 2014, pp. 78–87.
[6] X. Hu, T.-c. Chiueh, and K. G. Shin, “Large-scale malware indexing using function-call graphs,” in Proceedings of the 16th ACM Conference on Computer and Communications Security, 2009, pp. 611–620.
[7] S. H. H. Ding, B. Fung, and P. Charland, “Kam1n0: Mapreduce-based assembly clone search for reverse engineering,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
[8] B. Kang, T. Kim, H. Kwon, Y. Choi, and E. G. Im, “Malware classification method via binary content comparison,” in Proceedings of the 2012 ACM Research in Applied Computation Symposium, 2012, pp. 316–321.
[9] I. Santos, F. Brezo, J. Nieves, Y. K. Penya, B. Sanz, C. Laorden, and P. G. Bringas, “Idea: Opcode-sequence-based malware detection,” in Proceedings of the International Symposium on Engineering Secure Software and Systems, 2010, pp. 35–43.
[10] B. H. Ng and A. Prakash, “Expose: Discovering potential binary code re-use,” in Proceedings of the IEEE Annual Computer Software and Applications Conference, 2013, pp. 492–501.
[11] H. Huang, A. M. Youssef, and M. Debbabi, “Binsequence: Fast, accurate and scalable binary code reuse detection,” in Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, 2017, pp. 155–166.
[12] Y. David and E. Yahav, “Tracelet-based code search in executables,” ACM SIGPLAN Notices, vol. 49, no. 6, pp. 349–360, 2014.
[13] B. S. Baker, U. Manber, and R. Muth, “Compressing differences of executable code,” in Proceedings of the ACMSIGPLAN Workshop on Compiler Support for System Software, 1999, pp. 1–10.
[14] Q. Feng, R. Zhou, C. Xu, Y. Cheng, B. Testa, and H. Yin, “Scalable graph-based bug search for firmware images,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, pp. 480–491.
[15] M. Bourquin, A. King, and E. Robbins, “Binslayer: Accurate comparison of binary executables,” in Proceedings of the ACM SIGPLAN Program Protection and Reverse Engineering Workshop, 2013.
[16] S. Cesare, Y. Xiang, and W. Zhou, “Control flow-based malware variant detection,” IEEE Transactions on Dependable and Secure Computing, vol. 11, pp. 307–317, 2014.
[17] W. M. Khoo, A. Mycroft, and R. Anderson, “Rendezvous: A search engine for binary code,” in Proceedings of the 2013 Working Conference on Mining Software Repositories, 2013, pp. 329–338.
[18] M. Lindorfer, A. Di Federico, F. Maggi, P. M. Comparetti, and S. Zanero, “Lines of malicious code: Insights into the malicious software industry,” in Proceedings of the ACM Annual Computer Security Applications Conference, 2012, pp. 349–358.
[19] S. Alrabaee, P. Shirani, L. Wang, and M. Debbabi, “Fossil: A resilient and efficient system for identifying foss functions in malware binaries,” ACM Transactions on Privacy and Security, 2018.
[20] ——, “Sigma: A semantic integrated graph matching approach for identifying reused functions in binary code,” Digital Investigation, pp. S61–S71, 2015.
[21] L. Luo, J. Ming, D. Wu, P. Liu, and S. Zhu, “Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection,” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2014, pp. 389–400.
[22] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “Neural network-based graph embedding for cross-platform binary code similarity detection,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 363–376.
[23] J. Gao, X. Yang, Y. Fu, Y. Jiang, and J. Sun, “Vulseeker: A semantic learning based vulnerability seeker for cross-platform binary,” in Proceedings of the 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018, pp. 896–899.
[24] E. Mengin and F. Rossi, “Binary diffing as a network alignment problem via belief propagation,” in Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 967–978.
[25] L. Massarelli, G. A. D. Luna, F. Petroni, R. Baldoni, and L. Querzoni, “Safe: Self-attentive function embeddings for binary similarity,” in Proceedings of the 2019 International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2019, pp. 309–329.
[26] F. Zuo, X. Li, Z. Zhang, P. Young, L. Luo, and Q. Zeng, “Neural machine translation inspired binary code similarity comparison beyond function pairs,” in Proceedings of the 2022 Network and Distributed System Security Symposium (NDSS), 2022.
[27] B. Liu, W. Huo, C. Zhang, W. Li, F. Li, A. Piao, and W. Zou, “ $\alpha$ diff: Cross-version binary code similarity detection with dnn,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, pp. 667–678.
[28] S. H. Ding, B. C. Fung, and P. Charland, “Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 472–489.
[29] J. Pewny, B. Garmany, R. Gawlik, C. Rossow, and T. Holz, “Cross-architecture bug search in binary executables,” in Proceedings of the 2015 IEEE Symposium on Security and Privacy, 2015, pp. 709–724.
[30] W. Jin, S. Chaki, C. Cohen, A. Gurfinkel, J. Havrilla, C. Hines, and P. Narasimhan, “Binary function clustering using semantic hashes,” in Proceedings of the 2012 International Conference on Machine Learning and Applications, 2012, pp. 386–391.
[31] U. Kargén and N. Shahmehri, “Towards robust instruction-level trace alignment of binary code,” in Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), 2017, pp. 342–352.
[32] M. Chandramohan, Y. Xue, Z. Xu, Y. Liu, C. Y. Cho, and H. B. K. Tan, “Bingo: Cross-architecture cross-os binary search,” in Proceedings of the ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2016.
[33] S. Wang and D. Wu, “In-memory fuzzing for binary code similarity analysis,” in Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2017, pp. 319–330.
[34] S.-C. Wang, C.-L. Liu, Y. Li, and W.-Y. Xu, “Semdiff: Finding semtic differences in binary programs based on angr,” in Proceedings of the 2017 ITM Web of Conferences, 2017.
[35] Y. David, N. Partush, and E. Yahav, “Statistical similarity of binaries,” ACM SIGPLAN Notices, vol. 51, no. 6, pp. 266–280, 2016.
[36] ——, “Similarity of binaries through re-optimization,” in Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2017.
[37] A. Lakhotia, M. D. Preda, and R. Giacobazzi, “Fast location of similar code fragments using semantic ‘juice’,” in Proceedings of the 2013 ACM SIGPLAN Program Protection and Reverse Engineering Workshop, 2013.
[38] J. Pewny, F. Schuster, L. Bernhard, T. Holz, and C. Rossow, “Leveraging semantic signatures for bug search in binary programs,” in Proceedings of the Annual Computer Security Applications Conference, 2014, pp. 406–415.
[39] Q. Feng, M. Wang, M. Zhang, R. Zhou, A. Henderson, and H. Yin, “Extracting conditional formulas for cross-platform bug search,” in Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, 2017, pp. 346–359.
[40] H. Zhang and Z. Qian, “Precise and accurate patch presence test for binaries,” in Proceedings of the 27th USENIX Security Symposium (USENIX Security 18), 2018, pp. 887–902.
[41] Z. Jiang, Y. Zhang, J. Xu, Q. Wen, Z. Wang, X. Zhang, X. Xing, M. Yang, and Z. Yang, “Pdiff: Semantic-based patch presence testing for downstream kernels,” in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 1149–1163.
[42] L. Zhao, Y. Zhu, J. Ming, Y. Zhang, H. Zhang, and H. Yin, “Patchscope: Memory object centric patch diffing,” in Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, pp. 149–165.
[43] D. Liu, Y. Li, Y. Tang, B. Wang, and W. Xie, “Vmpbl: Identifying vulnerable functions based on machine learning combining patched information and binary comparison technique by lcs,” in Proceedings of the 2018 17th IEEE International Conference On Trust, Security And Privacy (TrustCom/BigDataSE). IEEE, 2018, pp. 800–807.
[44] Y. Xiao, Z. Xu, W. Zhang, C. Yu, L. Liu, W. Zou, Z. Yuan, Y. Liu, A. Piao, and W. Huo, “Viva: Binary level vulnerability identification via partial signature,” in Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2021, pp. 213–224.
[45] Y. Xu, Z. Xu, B. Chen, F. Song, Y. Liu, and T. Liu, “Patch based vulnerability matching for binary programs,” in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2020, pp. 376–387.
[46] X. Li, Y. Qu, and H. Yin, “Palmtree: Learning an assembly language model for instruction embedding,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021, pp. 3236–3251.

Appendix A Appendix

A-A Considerations of Header Files

In Section III B:Locating Signature Instructions and Challenges, when we locate the vulnerable binary code from source code, in some cases, certain CVEs include source code changes in the corresponding header (.h) files. However, such CVEs are less frequent in number (i.e., 26 CVEs in total out of nearly 1000 CVEs). After manually analyzing the changes in the header (.h) files due to the addition of patching codes, we discovered three types of changes, including 1) Change in MACRO values, 2) change in structure member variables (i.e., changing, adding, or deleting structure member variables, and 3) change in the function definition.

Change of MACRO value.: Diffing the source code versions cannot detect MACRO value changes in the .h file (e.g., #define HAVE_DIRENT_H 1 to #define HAVE_DIRENT_H 0). However, we omit this concern due to the tiny number of the MACRO changing related CVEs (i.e., 8 out of nearly 1000 CVEs).

Change of structure member.: Structure members can be modified in a few ways — changing, adding, or deleting. Figure 8 demonstrates some examples for each type. Specifically, changing structure members include renaming the member and changing the type of existing member. Figure 8 B and C shows two examples of renaming member and changing member type, respectively. The source code with red text font represents the code to be deleted, and the green font means the instructions to be added in the patched version. If a structure member is deleted, the .c files source codes mentioning the member must also be deleted. Those source code line changes can be detected by diffing the .c files. For the adding or changing structure members cases, it is difficult to detect the changes by diffing the .c source codes. For example, assume that we observe a structure member is added, as shown in Figure 8 E, one may think that the added structure member new_member may not have a corresponding update in the .c file referencing it. To identify this case’s frequency, we manually inspected all 26 CVEs. We found that when members are added or changed in the structure, there must be corresponding new source code lines updated in the .c files referencing them. Therefore, in approximately 1000 vulnerability functions, we found in 100% cases, the changed or added members have corresponding references in the .c files. A possible explanation is that the newly added or changed structure members are specifically designed to be used in the .c files to avoid vulnerabilities. Figure 8 D and E show two examples of deleting and adding a structure member, respectively. In the approximate 1000 vulnerabilities, we observe that all the structure member changing, adding, or deleting can be detected with the diff tool.

Change of function definition.: Changing of function definition refers to changing function calling parameters (e.g., change function definition static void j2k_write_sot(opj_j2k *j2k) to static void j2k_write_sot(opj_j2k *j2k, int lenp)). This category of change can be reflected in the source code. Function calling parameter changes can be detected in the .c files referencing that function because the source code must be updated to handle different parameters. Since we are extracting vulnerable function signatures between the vulnerable and patched versions, newly added functions and deleted functions are out of scope because they either only exists in the vulnerable version or only in the patched version.