Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Towards Practical Binary Code Similarity Detection: Vulnerability Verification via Patch Semantic Analysis

Published: 30 September 2023 Publication History

Abstract

Vulnerability is a major threat to software security. It has been proven that binary code similarity detection approaches are efficient to search for recurring vulnerabilities introduced by code sharing in binary software. However, these approaches suffer from high false-positive rates (FPRs) since they usually take the patched functions as vulnerable, and they usually do not work well when binaries are compiled with different compilation settings.
To this end, we propose an approach, named Robin, to confirm recurring vulnerabilities by filtering out patched functions. Robin is powered by a lightweight symbolic execution to solve the set of function inputs that can lead to the vulnerability-related code. It then executes the target functions with the same inputs to capture the vulnerable or patched behaviors for patched function filtration. Experimental results show that Robin achieves high accuracy for patch detection across different compilers and compiler optimization levels respectively on 287 real-world vulnerabilities of 10 different software. Based on accurate patch detection, Robin significantly reduces the false-positive rate of state-of-the-art vulnerability detection tools (by 94.3% on average), making them more practical. Robin additionally detects 12 new potentially vulnerable functions.

1 Introduction

Recurring vulnerabilities, also known as 1-day vulnerabilities [3], in open source libraries have spread widely due to code reuse and sharing [53], which have become one of the most significant threats in cyber-security. For example, the HeartBleed bug (CVE-2014-0160) discovered in OpenSSL, as a 1-day vulnerability, has influenced 24% to 55% of popular HTTPS websites worldwide [25]. There are two major ways to detect vulnerabilities, dynamic and static approaches. Among all dynamic approaches, fuzzing [49, 51] is the traditional and most commonly used way to detect vulnerabilities within the software. It executes the program with mutated inputs and monitors the abnormal behaviors, which often suggest potential vulnerabilities. As a result, fuzzing can only test the code which the program execution can cover. As the program goes large, fuzzing can only test a small portion of it to find the vulnerability. Due to the limited code coverage, fuzzing can not confirm each given vulnerable function.
Static methods are superior to dynamic ones for detecting 1-day vulnerabilities because they yield fewer false negatives. Considering the widespread existence of 1-day vulnerabilities, minimal false negatives (i.e., fewer overlooked vulnerabilities) are becoming a significant factor. These works leverage the binary code similarity detection (i.e., function matching) technique [20, 23, 28, 30, 35, 39, 55] by extracting kinds of signatures from the vulnerable functions, to find similar functions and take them as potentially vulnerable. The method is effective for covering all functions in the entire program. However, existing approaches focus more on improving the accuracy of function matching results to precisely detect the candidate functions that include the vulnerability. For example, DiscoveRE [28] proposes multiple syntactic features and conducts matching between function control flow graphs (CFGs). Bingo [20] combines both syntactic and semantic information and applies a wise inlining strategy to achieve a more accurate matching. Genius [30] and Gemini [55] propose a new representation of function: Attributed CFG (ACFG). Then they encode ACFGs into vectors using machine learning and deep learning technology and achieve a faster and more accurate match. BinDiff\(_{NN}\) [50] proposes a Siamese classification embedding network to highlight the function changes and measure the semantic similarity between functions. Although these approaches have high accuracy in matching functions with small changes, they tend to have a high false-positive rate (FPR) in finding vulnerabilities. The false-positive cases are due to the program patches. Specifically, the vulnerable function may be patched so that the vulnerability is no longer present. Patches are usually small code changes compared to the size of the function [57]. The function matching algorithm is designed to be tolerant of subtle changes so that it will be very likely to match the patched function to the vulnerable function signature [56]. Thus, the vulnerability matching results will be a mix of patched and vulnerable functions, which are difficult to differentiate and need more careful and laborious confirmation by experts.
Syntactic-based patch presence detection algorithms are proposed to filter out the patches and improve the detection accuracy. For instance, in Fiber [63], binary-level patch signatures are generated from source-level patches, and a signature match is conducted for a given binary function. The matching process in Fiber relies on syntactic features like CFGs and basic blocks, which are used to align instructions between target binaries and reference binaries. After alignment, Fiber generates symbolic constraints from patch-related code as patch signatures. PDiff [34] employs a distinct approach by identifying the patch-related anchor blocks, slicing the function CFG based on them, and then extracting symbolic formulas from these paths to generate a patch digest. By quantifying the similarity of patch digests among patched, vulnerable, and target functions, PDiff determines the presence of a patch in the target functions. BinXray [56] makes progress by performing detection in only binary-level patches. Specifically, it locates the patches in binary functions and extracts execution traces through patches as signatures. Then, it matches the signature in the given target function to confirm the patch presence. These approaches heavily rely on syntactic information, disregarding the potential semantic differences introduced by patches. Using only syntactic information can be precise in capturing subtle changes. However, when the source code is compiled into binary programs, the syntactic information can be easily changed by choosing different compilation settings [20, 30, 42]. The patch detection accuracy will drop when matching the signatures, which are compiled from one setting, to the target binaries, which are compiled from other settings. For instance, PDiff [34] fails to detect patches across binaries compiled by different optimizations because it relies on locating anchor blocks, which may be removed by optimization. Moreover, the results produced by function matching tools usually contain unrelated functions that are neither vulnerable nor patched [54]. Existing approaches cannot distinguish vulnerable and patched functions from unrelated ones. Therefore, these approaches may not be practical for real-world programs.
Based on the limitations of existing function matching tools that may tolerate subtle patched code changes, and the shortcomings of current patch detection methods, we propose four key capabilities that a patch detection method should possess for effective verification of binary vulnerabilities:
C1.
The ability to detect binary functions even in the absence of source-level debug information.
C2.
Scalability to handle large binary programs efficiently.
C3.
Precise identification of patched functions and vulnerable functions.
C4.
Accurate detection of patches across different compilation optimizations by considering semantic differences rather than just syntactic differences.
C1 is necessary as patch detection is intended to be performed after function matching tools, which do not require access to source code, and it enhances the results of function matching by effectively eliminating patched functions. C2 ensures that the patch detection method is scalable and can handle real-world applications with a large number of functions effectively. C3 allows for the precise identification of vulnerable and patched functions, which can refine the results obtained from function matching. C4 is essential for the patch detection method to be capable of handling differences caused by compilation settings, particularly compilation optimizations. Considering that real-world binary similarity detection techniques often yield functions compiled with different compilation optimizations, and optimization information is usually unavailable, it is imperative to have patch detection techniques that can accurately detect patches across different compiler optimizations [30, 55, 59].
To fulfill these four capabilities, we propose a semantic-based vulnerability confirmation tool called Robin. For C1, Robin identifies patched code that fixes vulnerabilities by comparing vulnerable and patched functions at the binary level using a diffing technique. Additionally, Robin employs symbolic execution on binary functions to extract semantic features of the patch, enabling detection of patches or vulnerabilities in target binary functions. For C2, drawing inspiration from [16, 19], Robin utilizes a lightweight symbolic execution technique called malicious function input (MFI) to efficiently generate function inputs that drive the execution from the function entry point to the patch code in patched functions or the vulnerability code in vulnerable functions. For C3 and C4, Robin incorporates a functioning monitor that checks whether the input triggers the same vulnerable behaviors (e.g., null pointer dereference (NPD)) in the target function or the same patched behaviors to confirm the presence of a vulnerability or patch. Furthermore, Robin captures semantic behaviors from execution traces and uses behavior summaries (i.e., semantic features) to determine if the target function is vulnerable or patched. Compared to Fiber’s reliance on source code and syntactic feature matching, Robin generates patch signatures solely from binaries and utilizes MFI to extract possible vulnerability semantics without requiring syntactic information. This allows Robin to achieve better performance in patch presence detection across different optimization levels.
We have implemented a prototype of Robin and made it open source [10]. Our evaluation of Robin on 287 real-world vulnerabilities in 10 software from different application domains shows that our method achieves an average accuracy of 80.0% and 80.5% for patch detection and filtering, respectively, across different optimization levels and compilers. These results outperform the state-of-the-art patch detection tools BinXray, PMatch, and Fiber by large margins. Furthermore, we have used Robin to filter out patches in the results produced by other function matching tools, namely Gemini [55], SAFE [42], and Bingo [20], and the results demonstrate a significant reduction in FPRs by 95.13%, 92.31%, and 95.36%, respectively, while also improving recall. Additionally, Robin has detected 12 new potential vulnerable functions in our experiments.
In summary, our work makes the following contributions:
We conduct a study (Section 2.1) on function-matching-based vulnerability detection tools to demonstrate their inability to distinguish patches and vulnerabilities.
In this work, we propose inventively MFI, a carefully crafted function input that steers function execution to patched or vulnerable code. In addition, building MFI and using it for patch detection does not necessitate debug information from source code, making our tool more scalable in binary patch detection since patch source code is not always available.
We implement a prototype, Robin, by employing MFI to detect the patch presence and verify the vulnerabilities across different optimizations, and open-source it [10].
We evaluate Robin on 287 real-world vulnerabilities to show that Robin can identify the objective patched functions with 80.0% accuracy in 0.47 seconds per function.
We conduct patch detection on candidate functions output by function matching. The results show that after detection, the FPRs of function matching tools are significantly reduced by an average of 94.27% in the top 10 results. Besides, Robin detects 12 new potentially vulnerable functions.

2 Background Information

2.1 Motivation Study on Function Matching

To expose the limitations of existing binary-level function matching tools in distinguishing between patched and vulnerable functions, we conducted a study using real-world applications. These function matching tools utilize known vulnerable functions as signatures to identify similar functions in the target binary program. Therefore, we followed the same approach and re-implemented three state-of-the-art binary matching tools: Gemini [55], SAFE [42] (a remarkable self-attentive embedding-based method), and Bingo [20] (a scalable and robust cross-optimization method) for evaluation purposes.
We selected three well-known projects, namely OpenSSL, Freetype, and OpenSSH, for our study. These projects have been commonly used for evaluation in previous works [20, 42, 55]. We collected a total of 54 CVEs for OpenSSL, 38 CVEs for Freetype, and 5 CVEs for OpenSSH (details in Section 4.1.1). To ensure that the number of vulnerable and patched functions is comparable, we selected a specific version for each project. As a result, we chose OpenSSL-1.0.1l (with 28 vulnerable functions and 26 patched functions), Freetype-2.4.10 (with 20 vulnerable functions and 18 patched functions), and OpenSSH-6.9p1 (with 3 vulnerable functions and 2 patched functions). We then ran the function matching tools to search for vulnerabilities in these projects, which measured the similarity score and ranked the target functions accordingly.
We utilized two metrics, Recall and FPR, to evaluate the performance of function matching tools in our study. Recall is calculated as \(Recall = \frac{TP_{v}}{N_{v}}\), where \(TP_{v}\) represents the number of instances in which a tool accurately detects a vulnerability, and \(N_{v}\) represents the total number of vulnerable functions in the target binary. FPR is calculated as \(FPR = \frac{FP_p}{N_p}\), where \(N_p\) denotes the number of patched functions in the target binary, and \(FP_p\) represents the number of cases where a tool reports a patched function as vulnerable. Note that our evaluation setting and criteria are different from those used in the original articles of the function matching tools we compared against (Gemini [55], SAFE [42], and Bingo [20]). The original articles typically used datasets containing homologous and non-homologous functions, where non-homologous functions are diverse and easy to recognize. However, in our study, we focused on challenging scenarios where vulnerable and patched functions are difficult for function matching tools to identify. Therefore, we adopted stricter assessment criteria, considering a match between a patched function and a vulnerable function as a false positive.
Table 1 demonstrates that the three BCSD tools achieve high recall rates for vulnerability detection, with recall@top-1 exceeding 82% and recall@top-5 exceeding 89%. However, it can also be observed that all the tools have relatively high FPRs, with over 67% at top-1, and the FPRs continue to increase as the value of K (top-K) increases. Upon manual confirmation, it was found that Bingo reports 43 patched functions as vulnerable by placing them in the top five most vulnerable candidates out of 46 test cases. However, false positive cases are not evenly distributed in the top 10, as none are ranked between positions 5 and 10. As a result, the number of false positives in the top 5 and top 10 are the same, i.e., 43, because the remaining three patched functions are ranked outside of the top 10. This suggests that these approaches may not be practical for the vulnerability detection task in real-world scenarios. The high FPRs indicate that the time cost of further verification outweighs the time saved from function matching. Therefore, it is critical to have an automatic vulnerability verification method to reduce the FPR so that function matching approaches can be practical in real-world vulnerability detection tasks.
Table 1.
BCSD ToolsRecall@Top-1Recall@Top-5Recall@Top-10FPR@Top-1FPR@Top-5FPR@Top-10
Gemini82.14%89.28%89.28%67.39%80.43%89.13%
Safe85.71%92.85%92.85%71.74%82.61%84.78%
Bingo89.28%92.85%92.85%91.30%93.48%93.48%
Table 1. Recall and FPR of BCSD Tools in Study

2.2 Preliminaries

2.2.1 Terms and Definitions.

In this section, we summarize the frequently used terms in this work.
Under-constrained Symbolic Execution. Symbolic execution is a means of static program analysis that is able to execute a program without concrete inputs. Under-constrained symbolic execution [27] is a variant of symbolic execution and is able to execute code from any place in the program.
Path Constraint. A path constraint is the condition of input values that leads the program execution to follow the corresponding path.
Vulnerability Point. A vulnerability point is the location where the vulnerability occurs (e.g., program crashes) in the binary function.
Mitigation Point. A mitigation point is the basic block where the vulnerability is fixed in a patched function. There may be more than one changed block due to patching. We choose the block that handles the error (e.g., the code stops the execution, raises an exception, and jumps to another general error handling code) [41], as the mitigation point.
Feasible Path. A feasible path is a path whose corresponding constraints do not conflict with each other. For example, if an execution path contains conflicting constraints, such as “\(argument_1 \gt 0\)” and “\(argument_1 \lt 0\)”, the path is unfeasible.
Listing 1.
Listing 1. Assembly Code of Patched Function in Figure 1.
Fig. 1.
Fig. 1. Different function execution traces (green dotted lines) fed the same special function input.

2.3 Key Concept Demonstration

We use CVE-2015-0288 [2] to explain the idea of our work. CVE-2015-0288 is an NPD vulnerability before OpenSSL version 1.0.1m at function \(X509\_to\_X509\_REQ\), which means the vulnerability exists in OpenSSL 1.0.1l, and is patched in OpenSSL 1.0.1m. From the study in Section 2.1, we know function matching tools tend to match the patched function with vulnerable function, which results in false positives in detecting the vulnerability of OpenSSL 1.0.1m.
To address this problem, we leverage the subtle semantic differences caused by the patches to precisely identify patched and vulnerable functions. We use the function CFGs in Figure 1 to explain our key concept. These two partial CFGs are constructed from a vulnerable function (CFG on the left) and a patched function (CFG on the right), respectively. The block V1 in the vulnerable CFG is the vulnerability point (i.e., where the vulnerability exists). The blocks P1, P2, and P3 in the patched CFG are patch blocks, where P2 is the mitigation point. The assembly code for these three basic blocks is shown in Listing 1. Block P1 contains a check for the pointer not being null (line 4). If the check detects malicious values (i.e., 0 in the target memory in line (4), the execution steps into the error handling block P2. Otherwise, the execution steps through block P3 and resumes the normal operation.
To summarize the semantic differences, we first determine a function input \(I\), which can lead the execution of the patched function to the mitigation point. The same function input \(I\) can also lead the execution to the vulnerability point in the vulnerable function. We call such function input MFI. By executing the vulnerable and patched functions with MFI, we can monitor and collect the run-time semantic differences. Since the semantic information will not change across binaries compiled from different optimization levels, Robin can identify the patched and vulnerable functions across different compilation settings.

2.4 Assumptions

The work is predicated on the assumption that there is at least one distinct function input that can drive execution to both the mitigation point in the patched function and the vulnerability point in the vulnerable function. We take the function execution on CFGs in Figure 1 as an example. To begin with, we assume a function input \(I\) exists, which drives execution to mitigation point (P2) in patched function (CFG on the right). This function input fails the security check (P1) and leads the execution to the error handling block (P2). Clearly, the code in the patched function ahead of the patched block (P1) and in the vulnerable function ahead of the vulnerable block (V1) are identical (i.e., blocks C1, C2, and C3). Thus, input \(I\) also drives execution in vulnerable function to C3, and then leads execution to vulnerability point (V1). We chose block P2 as a mitigation point rather than P3 since the execution through block P2 is the most dissimilar to the execution trace in the vulnerable function.

3 Design of Robin

3.1 Workflow Overview

Figure 2 shows the overview of Robin. The core of Robin is to construct the MFI via under-constrained symbolic execution. Robin generates semantic features with MFI to prove the presence of vulnerable code or patched code. In offline mode, Robin takes the vulnerable function and its patched version as inputs for a given vulnerability. It creates MFI and patch signatures based on these two functions. For online detection, Robin utilizes the generated MFI and signature to report whether a target binary function is patched, vulnerable, or irrelevant to vulnerabilities.
Fig. 2.
Fig. 2. Overview of Robin. MFI is short for malicious function input. Func is short for function. V&P stands for Vulnerability&Patch.
First ➀, Robin compares the differences between vulnerable and patched functions to locate all of the patch blocks. Second ➁, for each patch block, Robin finds a feasible path with building function inputs that can drive the function to the patch block. Third ➂, it produces signatures for every function input by executing the vulnerable and patched functions, and the most distinctive function input is selected as MFI based on their function signatures. Last ➃, Robin feeds MFI to target binary function for execution and extracts the semantic signature to confirm the vulnerability or patch.

3.2 Patch Localization

To generate MFI, Robin first needs to determine the mitigation points in the patched functions that fix the vulnerability (e.g., the red block on the right CFG of Figure 1). Since a mitigation point is a patch block or a subsequent block of patch blocks, Robin first localizes all of the patch blocks by comparing the differences between the vulnerable and patched functions. Moreover, since the compiler will introduce unexpected changes while compiling, which affect the comparison result, we apply normalization before comparison to improve the accuracy.

3.2.1 CFG Normalization.

The CFG normalization aims at reducing the impact of side effects introduced by the compiler. The side effects, such as the changes in instruction address offsets and jump target addresses, make the instructions different at the syntax level, even if they are compiled from the same source code. When diffing the functions, these compiler-introduced changes are regarded as noise. In this work, Robin applies four different instruction normalization rules: (i) register replacement, (ii) memory reference replacement, (iii) constant replacement, and (iv) address replacement. The examples are listed below:
(i)
Register replacement replaces all register names with “REG”: “push ebp” -> “push REG”.
(ii)
Memory reference replacement replaces all memory references with “MEMACC”: “mov eax, [ebp-0x10]” -> “mov eax, MEMACC”.
(iii)
Address replacement replaces the immediate addresses with “ADDR”: “call 0x400567” -> “call ADDR”.
(iv)
Constant replacement replaces constants with “CONSTANT”: “sub esp, 0x98” -> “sub esp, CONSTANT”. Note that since the constant values in the sanity check instruction carry important information related to the vulnerability, we do not normalize this kind of constant.

3.2.2 Patch Block Selection (PBS).

After normalization, Robin identifies the patch blocks via diffing the normalized CFGs. To get a more precise diff result, we apply the basket algorithm [56] in Robin. This algorithm begins by computing the hash values of all blocks in CFGs. The blocks with equal hash values are then placed in the same basket. It chooses patch blocks by inspecting each basket. If the number of basic blocks in the basket is even, the algorithm will consider them match blocks and skip the inspecting of the bucket. For the remaining baskets, it matches block pairs based on their context (i.e., predecessors and successors). The unmatched one is regarded as a patch block.
There are two scenarios in which Robin cannot generate MFI in patch blocks. First, the patch is too simple because it only changes the conditions without adding error handling logic, which results in no mitigation point. Second, when the patch block is located at function entry, Robin cannot collect any constraints to solve function input since the constraints are built from branch execution. For example, in Listing 1, branch constraint “[ebp+pktmp] == 0” is built when execution moves from block P1 to block P3. Thus, if the patch block locates at function entry, no branch execution occurs, and Robin cannot get branch constraints to build function input. To address these two scenarios, we design two extension strategies for the patch block set:
S1.
If a patch block contains a conditional check and one of its successor blocks is an error-handling block, Robin adds the error-handling block to the patch block set.
S2.
If patch block set contains only one patch block and the patch block locates at function entry, Robin adds any of its successor blocks to patch block set.
Note that if CFG of the patched function consists of only one block, it is considered to be a patch block.

3.3 Function Input Building

Robin hunts feasible paths (as defined in Section 2.2.1) leading to patch blocks, and builds a function input from each path. Compared to the fuzzing technique [51], symbolic execution is more efficient and effective in obtaining the path to the target code position. It is because fuzzing generates random inputs and wishes that such inputs can lead the execution of the patch blocks. Symbolic execution can directly traverse paths that end with patch blocks without providing a concrete input. The symbolic execution makes fewer attempts at execution than fuzzing. Thus, Robin utilizes the symbolic execution technique to build function inputs.

3.3.1 Feasible Path Hunting (FPH).

Given a patch block, Robin uses a lightweight symbolic execution technique to search for a feasible path from the function entry block to it. For instance, in Figure 1, the sequences of basic blocks connected by the green polylines are feasible paths found by Robin. Since we only perform symbolic execution within a function, we adopt under-constrained symbolic execution. Under-constrained symbolic execution executes an arbitrary function within the program, effectively skipping the costly path prefix from the program’s main entry to the target function. However, even with under-constrained symbolic execution, traversing all possible execution paths is still inefficient and ineffective for this task due to path explosion and time-consuming constraint solving [27]. To minimize the time cost, we adopt drill-down lightweight symbolic execution to boost path hunting.
First, we conduct a static analysis to choose the shortest paths between the patch block and the function entry as candidate paths. We execute them using the symbolic execution technique to solve the path constraints by finding the concrete value for each symbolic variable. For example, if the path constraint contains “\(x \gt 1\), \(x \lt 5\)” and the variable \(x\) is an integer, the solution to \(x\) can be any of 2, 3, or 4. If path constraints contain conflicting conditions (e.g., “\(x\gt 1\)” and “\(x\lt 0\)”), then the symbolic variable \(x\) is unsolvable and the path is an unfeasible path. If it happens, we will select another shortest path and continue constraint solving on the new path.
Second, if all the shortest paths are not feasible, we utilize the backward slicing technique to prune the function CFG to get more valid paths. To conduct slicing, we choose all variables in the check blocks as the slicing start points and find all relevant variables (either register values or memory values) by backward data flow analysis. Then, we remove the basic blocks that do not contain relevant variables from the CFG and the remaining blocks from a partial CFG. We repeatedly search for the possible paths in the partial CFG and apply the same path feasibility verification method to check whether they are feasible.
Last, if the first two steps don’t give us a path that works, we do a time-consuming blind symbolic execution to find a path from the function entry to the patch block. A blind symbolic execution is an exploration of all possible paths in the target function without any concrete values specified in registers or memory spaces. To speed up the execution, a loop detector is adopted to avoid the symbolic execution from getting stuck in a loop by limiting the maximum number of times the loop can be executed. Noted Robin will avoid stepping into a function call during the path hunting, which will be explained in Section 3.4. With the drill-down approach, we can conduct an efficient feasible path search to select patch blocks. For example, in Figure 1, We can find a feasible path for each of the patch blocks P1, P2, and P3. Thus, these three blocks are all selected for the next step.

3.3.2 Input Resolution (IR).

After finding a feasible path for the given patch block, Robin builds a function input by solving the symbolic variables from the path constraints. The function input components are listed below:
(1)
Function Parameters (\(FP\)). Function parameters are divided into two types: Numeric and pointer type. A numeric-type parameter is a constant value. A pointer type parameter refers to a memory region where a structure is located. The format of \(FP\) is \([[value, length],]\), where value is the numeric or the pointer, and length is the value size.
(2)
Global Variables (\(GV\)). Global variables are values defined outside of functions. The format of \(GV\) is \([address: [value, length]]\), where address is the memory address of global variables, and length is the byte size of global variables.
(3)
Partial Function Call Return Values (\(RV\)). Return Values of function calls are values that callee functions return along the feasible path. Since Robin only resolves the return values of function calls along the feasible path, not all return values are analyzed. Thus, we call these return values Partial Function Call Return Values. Similarly, the format of \(RV\) is [[call_flag, return_value, length],], where \(call\_flag\) is an identifier for the function call and is denoted with callee function names.
Input Initialization with Symbols. . Inspired by [44], Robin sets all function input variables to symbolic values prior to or during symbolic execution. Robin initially maps the heap and stack memory areas to the fixed memory areas in order to locate the function input variables in memory. Robin initializes function input as described below:
(1)
Function Parameters (\(FP\)). We assume that the target functions have a maximum of \(N_p\) function parameters. Robin sets the first \(N_p\) function parameters to symbolic values “\(arg_1\)”, “\(arg_2\)”, \(\ldots\), “\(arg_{N_p}\)”.
(2)
Global Variables (\(GV\)). Because the binary code accesses global variables via a basement and an offset (e.g., “mov ds:offset, 0”), Robin allocates all global variables to a separate memory area by setting the data segment register to a constant value of \(DS\).
(3)
Partial Function Call Return Values (\(RV\)). For “libc” library function calls, Robin executes the hooked versions of “libc” library function calls, which are implemented by the symbolic execution engine Angr [1]. For non-libc library function calls, Robin directly steps over the function call and sets the return value (i.e., return register) to a symbolic value \(call\_flag\). For example, recall the motivation example in Listing 1, supposing the instruction “\(call\ X509\_get\_pubkey\)” in block \(P2\) locates in address “0x8164B33”, Robin will set return register “\(eax\)” to a symbolic value with name “\(ret\_X509\_get\_pubkey\_8164B33\)”.
Robin conducts path hunting and generates feasible paths for patch blocks. Then Robin collects feasible path constraints and utilizes them to solve the variables of the function input components. Given that possible paths are frequently fragments of whole function execution, path constraints alone are insufficient to direct full function execution to function exit. Considering input-driven execution (Section 3.4.1) continues after traversing the feasible path, Robin must ensure not introducing new execution faults, which may be caused by assigning sensitive values to function input variables (e.g., assign NULL to a pointer). To this end, Robin has several principles to solve a safe function input:
P1.
If a symbolic input variable has many possible values, Robin makes an attempt to avoid taking 0 (or NULL).
P2.
Robin attempts to interpret function parameter variables as pointers to a structure.
P3.
If the patched function consists of only one block and no path constraint is built from execution, Robin will assign concrete values to the function input at random.
The reason for P1 is that when a symbolic input variable is solved for 0 rather than other feasible values, Robin may dereference it and report a false alarm of NPD by the technique in Section 3.5.2. The reason for P2 is that if a function parameter variable is not constrained to be a small constant value, Robin sets the parameter to an address from heap to ensure the execution will continue even if the parameter is attempted to be dereferenced. With these three principles, Robin solves all input variables to build function input. The structures pointed to by parameters can be constructed from path constraints.
Solution for External Objects. . For the case where the pointer points to an object outside the scope of the current function, Robin models the object by solving the constraints involving object field values in function computation. We use the function depicted in the Figure 3 to illustrate our solution with the external object, specifically, pointer dereferences (i.e., p-\(\gt\)offset_1). In the first step, Robin allocates a symbolic memory region for the external object “Outside_obj” and makes the pointer “p” point to the allocated memory region. Each field in the object is allocated a sufficient amount of memory (in our practice, 1024 bytes). The symbolic value “Symbolic Value 1” is loaded from the external object in the second step. In the third step, the symbolic value is incorporated into a constraint by executing the prepared path in Section 3.3.1. Robin determines a concrete value for the object field “offset_1” in the fourth step. Note that the field value may not be utilized in a predicate statement, resulting in it not being involved in path constraints. For example, the function may load a field and return its value directly. In such cases, Robin assigns a random value to the field because it does not affect the execution of the code (i.e., is not used in the predicate to guide the code execution). After the outside external fields are assigned with solved concrete values, the solved external object is recorded in MFI and can be used for function execution in patch detection.
Fig. 3.
Fig. 3. Workflow of external objects solving.
Running Example. . To illustrate the techniques and the generated function input, we take the partial function CFG in Listing 1 as an example. We assume that the feasible path goes through block P1 to block P3. In Input Initialization with Symbols, Robin sets the register \(ebp\) to a fixed value \(M_b\) to locate the stack memory. When executing instruction “\(push\, [ebp+x]\)” in block P1, “\(ebp+x\)” refers to a specific memory “\(M_b+x\)” rather than a symbolic memory. For function call return value initialization, Robin assigns the symbolic value “\(ret\_X509\_get\_pubkey\_8164B33\)” to \(eax\) which holds the return value after executing instruction “\(call\, X509\_get\_pubkey\)”. Next, the condition statement “\(cmp\, [ebp+pktmp],0\)” compare the return value with constant 0. The execution steps into block P2 of the true branch with a new constraint, “\(ret\_X509\_get\_pubkey\_8164B33 == 0\)”. After path constraints are solved, we will get the concrete return value of the function, i.e., “\(r_{X509\_get\_pubkey\_8164B33} = 0\)”. The final function input \(I\) for the feasible path is shown as follows:
\begin{equation} \begin{aligned}I := \lbrace &FP := [arg_1 = 0, arg_2 = None, arg_3 = None, \ldots ],\\ &GV := [ ``0xf0000001\hbox{''}: [0xf0010000, 4], \ldots ],\\ &RV := [[``r_{X509\_REQ\_new\_8164A93}\hbox{''}, 0xf0000001],\ldots ,\\ &[``r_{X509\_get\_pubkey\_8164B33}\hbox{''}, 0x0],\ldots ]\rbrace \end{aligned} \end{equation}
(1)

3.4 MFI Selection

Robin selects a function input that reaches the mitigation point as MFI. Recalling the assumption in Section 2.4, the MFI drives patched and vulnerable functions to execute different paths, and results in the greatest semantic difference between these two functions. For example, execution paths are totally different after the vulnerable or mitigation points, as shown in Figure 1. Therefore, Robin selects the mitigation point based on the semantic difference of function execution, with its function input as MFI. To this end, we design several semantic features to quantify the semantic difference. Given a function input, Robin performs input-driven execution to extract semantic features. Then, Robin selects the MFI based on the degree of execution semantic difference.

3.4.1 Input-driven Execution.

Given a function input \(I\), Robin initializes the run-time environments based on \(I\) and executes the function. Specifically, Robin performs three different operations:
O1.
Robin initializes function parameters according to \(FP\) in function input \(I\) before execution.
O2.
Robin initializes global variables according to \(GV\) in function input \(I\) before execution.
O3.
Robin pops a return value from \(RV\) instead of executing the callee function when handling a function call.
These three operations ensure that the patched or vulnerable function reaches at least the mitigation point or vulnerable point.
Function Call Handling. . Since patched and vulnerable functions share the same execution trail preceding the patched or vulnerable point (e.g., C1, C2, and C3 in Figure 1), these function calls in the trail will be invoked by the same parameters and will result in the same semantics. Besides, analyzing these functions will incur additional time costs. Robin steps over these function calls during the execution before the mitigation point or vulnerable point. When Robin encounters function call, it will check the return value queue \(RV\) and pop a return value as function call return. For example, in Listing 1, when the instruction “call X509_get_pubkey” is executed, Robin steps over it and sets the “eax” to a concrete value from input \(I\). If Robin finds the queue \(RV\) is empty, which means the execution reaches at vulnerable or mitigation point, it will step into the call.
Robin continues to execute the target functions until one of the following conditions occurs: (1) The target function returns. (2) Vulnerability is detected. (3) At least one symbolic variable is contained in the branch condition predicate. (4) The target address of the jump is unknown (i.e., symbolic). Robin is expected to terminate the execution under condition (1) in patched function or under condition (2) in vulnerable function. However, Robin does not always terminate the execution as expected as the function input \(I\) may be insufficient to drive the execution after the mitigation point or vulnerable point. Since Robin resolves the function input from only one feasible path, certain function parameters or global variables may be absent from it. Thus, we propose conditions (3) or (4) to terminate the execution.

3.4.2 Function Signature Extraction.

Robin extracts function signatures along the input-driven execution. We design four semantic features as a function signature to characterize the execution paths, as shown in Table 2. The function signature is denoted as \(\Delta = \lbrace MA, PC, AR, CA\rbrace\), where \(MA\) denotes the sequence of memory accesses, \(PC\) denotes the constants in comparison instructions (e.g., 0 in “\(cmp\ eax,\ 0\)”), \(AR\) denotes the sequence of arithmetic instructions such as “\(sub\)” and “\(add\)”. The last element, \(CA\), denotes the sequence of the arguments of all function calls. Robin incorporates a memory access monitor, which records the memory read and write activities as well as their associated memory addresses. Robin also recognizes the target instructions (i.e., comparison, arithmetic, and call instructions) to record the features.
Table 2.
FeaturesDenotedFormatNote
Mem Acc Seq\(MA\)[(W/R, address),]write or read operations with memory addresses, e.g., [(R,0xf00000)] from “\(mov\ eax,\ [0xf00000]\)
Predicate Const Seq\(PC\)[N,]The constant values in predicate of cmp instructions, e.g., [0,] from “\(cmp\ eax,\ 0\)
Arith Seq\(AR\)[mnemonic,]Mnemonics Sequence of Arithmetic Instructions, e.g., [sub,] from “\(sub\ eax,\ 2\)
Func Call Arg Seq\(CA\)[(arg,),]The argument sequence of function call. e.g., [(0x100,),] from “\(malloc(0x100)\)
Table 2. Semantic Features in CVE Signature

3.4.3 Input Selection (IS).

If patch blocks contain error-handling block(s), Robin only builds function input(s) for these error-handling block(s). As mentioned in Section 2.2.1, all error-handling blocks can be treated as mitigation points. When there is no error-handling block in patch blocks, Robin builds function inputs for all patch blocks and then selects MFI based on the degree of semantic difference generated by the function inputs. Specifically, given a function input \(I\), Robin performs input-driven execution on patched and vulnerable functions, respectively. After that, Robin extracts two function signatures \(\Delta _p\) and \(\Delta _v\) for patched function and vulnerable function. Robin calculates a semantic difference score \(S_{I}\) based on \(\Delta _p\) and \(\Delta _v\) for function input \(I\) according to Equation (2).
\begin{equation} S_{I} = 1-\frac{LCS (Concat(\Delta _p),Concat(\Delta _v))}{max (len(Concat(\Delta _p), len(Concat(\Delta _v))} , \end{equation}
(2)
where function \(LCS(\cdot , \cdot)\) calculates the length of longest common sequence (LCS) [36] between two arrays, \(Concat(\cdot)\) concatenates all feature sequences in the function signature into an array, \(len(\cdot)\) returns the length of an array. This equation utilizes the LCS of two signature arrays to show the overlap between two executions. The less overlap, the greater the execution difference caused by function input, and the larger \(S_I\) is. Robin chooses the function input \(I\) with the maximum score \(S_I\) as MFI.
Vulnerability Signature . After selecting MFI for a vulnerability, Robin creates the vulnerability signature by combining the patched function signature, the vulnerable function signature, and the MFI: \(Sv = \lbrace \Delta _p, \Delta _v, MFI\rbrace\).

3.5 Vulnerability&Patch Confirmation

In this step, we aim at verifying the vulnerability and patch existence based on vulnerability signature \(Sv\). First, Robin prepares the running environments by identifying the calling convention of target functions, allowing Robin to perform three operations in Section 3.4.1 precisely. Then Robin extracts the signatures during the execution of target functions and determines whether they are vulnerable or patched. Additionally, Robin incorporates a sanity checker for the purpose of detecting NPD vulnerabilities.

3.5.1 Calling Convention Identification.

The calling convention is a low-level scheme for how functions receive parameters from their caller functions. For example, CDECL is a type of convention in which subroutine parameters are all passed on the stack [12]. Whereas, in the convention FASTCALL, the first two parameters are passed into registers (i.e., ecx, edx). The calling convention changes according to different compilers and optimization levels used to build the binaries. To identify the calling convention, we perform a data flow analysis to find variables which are used before being assigned or initialized [17]. According to the way that the parameter variables are passed into the callee functions, we can tell the calling convention the target function follows. With the right calling convention identified, we can put the solved function parameters in the right memory positions or general registers.

3.5.2 Confirmation.

Our primary goal is to verify the vulnerability or patch existence by running target function to vulnerable point or mitigation point with the MFI. Specifically, Robin performs input-driven execution on target functions with given MFIs as described in Section 3.4.1. During the execution, Robin extracts the function signature \(\Delta _t\) as described in Section 3.4.2. Besides, Robin incorporates a sanity checker to detect NPD vulnerabilities. This design is intended to allow Robin to behave whether code is built with different optimizations or is obfuscated.
Sanity Checker for NPD . The checker solves the target address of the memory dereference instructions. For example, for instruction “\(mov\ eax,\ [eax+0Ch]\)”, it will determine the concrete address value for “\(eax+0Ch\)”. Then, the checker evaluates the value for “\(eax\)” by retrieving the value stored at the register. If it equals 0, the target address will point to an invalid memory space, which incurs NPD vulnerability. The sanity checker reports such vulnerabilities by checking the value of target addresses. Since Robin only symbolically executes the target function and not the entire program, the memory layout constructed by other functions is unknown. Thus, Robin cannot detect memory overflow or information leak vulnerabilities, which require precise memory boundaries.
To address this issue, if Robin does not detect the NPD vulnerability during input-driven execution, it will determine the vulnerability probability \(Pv\) to show the likelihood of the function being affected by other types of vulnerabilities. Robin utilizes function signatures differences among \(\Delta _t\) (target function), \(\Delta _p\) (patched function), and \(\Delta _v\) (vulnerable function) to calculate the vulnerability probability according to Equation (3).
\begin{equation} \begin{aligned}Pv(\Delta _t, \Delta _v,& \Delta _p) = F_s(MA_t, MA_p, MA_v) \times \alpha + F_s(PC_t, PC_p, PC_v) \times \beta \\ &+ F_s(AR_t, AR_p, AR_v) \times \gamma + F_s(CA_t, CA_p, CA_v) \times \delta \end{aligned} , \end{equation}
(3)
\begin{equation} F_s(S_t, S_v, S_p) = \tanh \left(\frac{L_s(S_t, S_p)}{L_s(S_t, S_v)} -1 \right) , \end{equation}
(4)
\begin{equation} L_s(S_1, S_2) = \frac{LCS(S_1, S_2) + 1}{ max(len(S_1), len(S_2)) + 1} . \end{equation}
(5)
In Equation (3), we have \(\alpha + \beta + \gamma + \delta = 1\). The \(F_s(\cdot , \cdot , \cdot)\) calculates the similarity of a certain semantic feature between target function, patched function, and vulnerable function. Since related functions (both of which are vulnerable or patched) share a similar execution trace, the feature sequences derived from the similar execution traces are consistent. We use the normalized longest common string value as shown in Equation (5) to fit the execution trace similarity.To scale the scores to \([-1,1]\), Robin uses hyperbolic tangent function (\(tanh(\cdot)\)) in Equation (4) and assigns different weights to each feature. The final similarity score also has a range of \([-1,1]\). The larger the value, the more similar target function is to patched function. 0 means that target function does not share any similarity to either patched function or vulnerable function. Robin determines whether the objective function is vulnerable or patched based on the similarity score.

4 Evaluation

We aim at answering the following research questions (RQs):
RQ1: How accurate is Robin for patch detection across different compilation optimization levels, different compilers, and different architectures, compared to state-of-the-art related works?
RQ2: What is the performance of Robin, compared to state-of-the-art related works?
RQ3: How much Robin can improve the accuracy of state-of-the-art function matching-based vulnerability detection tools?
We implement Robin in Python with 8,592 lines of code, which supports Intel X86 32bit, 64bit, and ARM. We utilize IDA Pro [8] and IDAPython to disassemble the binary functions and construct the CFGs, and we use the symbolic execution engine Angr (9.0.5171) [1] and the theorem prover Z3Prover [7] to solve the PoC constraints. All the programs run at Ubuntu server with 56 cores CPU of Intel Xeon E5-2697 @ 2.6 GHz and 256 GB memory.

4.1 Experiment Setup

4.1.1 Dataset.

To test the accuracy and the performance of Robin, we select 10 real-world well-known projects from various application aspects (e.g., crypto and image process) and collect the corresponding vulnerability description (e.g., software name, vulnerable and patched versions, and function names) from NVD [9]. For each vulnerability collected, we compile the vulnerable and patched version of project source code to extract the vulnerable and patched function with GCC 7.5.0. For patches being applied to multiple functions, we manually select the function(s), which contain(s) the vulnerability for analysis. Then, we select functions from versions before the patch as the vulnerable target functions and functions after the patch as the patched target functions to form the ground truth data for the experiment. Specifically, we choose first version and last version from both vulnerable version range and patched version range. For example, CVE-2015-0288 [2] is a vulnerability in the OpenSSL version from 1.0.1a~1.0.1l. The versions after 1.0.1l (1.0.1m onward) are patched. We extract the vulnerable function from version 1.0.1l and the patched function from 1.0.1m. We regard versions 1.0.1a~1.0.1k as the vulnerable target versions and 1.0.1n or later versions as the patched target versions. We regard start version 1.0.1n and end version 1.0.1u from patched version range as the patched target versions since version 1.0.1u is the lastest version of OpenSSL 1.0.1.
Table 3 shows the dataset used to evaluate Robin. In total, we compiled 209 different versions of programs from 10 different projects, which include 287 CVEs. The types of vulnerabilities consist of NPD, buffer overflow, integer overflow, double free, and use after free. Except for the compiler optimization level, we choose the default compilation configuration to build the binary so that it is closer to a real-world case. For each version of the program, we compile it with four different optimization levels (e.g., O0—O3). To test the detection accuracy for different compilers and architectures, we choose two projects (OpenSSL and Tcpdump) to compile and test since these two projects have the most number of test cases. We utilize ICC (version 2021.1) and Clang (version 6.0) to compile them for cross-compiler dataset with O0 optimization level, and we utilize arm-linux-gcc (v7.5.0) compiler to compile them for ARM architecture dataset with O0 optimization level.
Table 3.
ProjectOpenSSLBintuilsTcpdumpFreetypeFfmpegOpenSSHLibexifLibpngExpatLibxml2Total
# of CVE6060684035116421287
Versions of Bin413142252344544201
Table 3. Dataset
The number of CVEs and versions of each software.

4.1.2 Weight Assignment.

As shown in Equation (3), Robin combines four semantic feature scores with different weights \(\alpha\), \(\beta\), \(\gamma\), and \(\delta\). To determine these weights, we adopt the linear regression algorithm in machine learning [47]. It learns the optimal weights to fit Equation (3) for the purpose of predicting vulnerabilities and patches. To begin, we use Robin to collect semantic features \(\Delta _t\) of all O0 optimization target functions. We train and test the linear regression model for ten times using \(\Delta _v\), \(\Delta _p\), \(\Delta _t\), as well as the target function label (i.e., 1 for patched function and \(-\)1 for vulnerable function), Each time, we randomly divide the dataset according to a ratio of 8 : 2, where 80% for training and 20% for testing. Following each training and testing, we have a set of weight values for \(\alpha\), \(\beta\), \(\gamma\), and \(\delta\) and the corresponding test accuracy. We choose the set of weight values with the maximum accuracy, which is \(\alpha =0.57\), \(\beta =0.11\), \(\gamma =0.18\), and \(\delta =0.14\), with a 92.42% test accuracy.

4.2 Accuracy Evaluation (RQ1)

We select two vulnerable target functions and two patched target functions for each CVE in our ground truth dataset to form 817 test cases. And we regard functions with similarity scores greater than 0 are patched in the patch detection evaluation.

4.2.1 Cross-optimization Levels Patch Detection.

In cross-optimization detection, we utilize Robin to conduct detection on binaries compiled with different compilation optimizations. Table 4 shows the cross-optimization level patch detection accuracy of Robin. The rows show the optimization level used to compile the target programs and the columns give the optimization level used to compile the signatures. Specifically, the top half of columns 2 to 4 of the table shows the accuracy for big projects (i.e., OpenSSL, Binutils, Tcpdump), which have more than 100 test cases. The bottom half gives the accuracy of the rest small projects (i.e., Freetype, Ffmpeg, Mixed). For the last column, \(Cross\) means the average accuracy matching the signature with the target compiled in different optimization levels for all projects. \(No-Cross\) refers to the accuracy of matching CVE in the same optimization level. In Table 4, the bold values represent the detection accuracy in non-cross optimization settings, serving as a benchmark for the results obtained in cross-optimization settings. From the table, we can see that Robin achieves high accuracy at all optimization level cross-matching combinations ranging from 60% to 98% accuracy with 80.0% on average. In General, predicting patch presence when the signature and target are with the same optimization level has higher accuracy (88%~93%) than across different optimization levels (75%~81%).
Table 4.
SoftwareOpenSSLBinutilsTcpdumpNo-Cross
 O0O1O2O3O0O1O2O3O0O1O2O3-
O00.940.800.750.720.900.840.830.810.940.620.620.600.93
O10.820.940.840.830.820.920.870.880.670.910.780.810.88
O20.760.840.920.900.840.870.920.880.650.700.980.950.90
O30.740.840.880.910.860.920.900.910.640.720.941.00.88
SoftwareFreetypeFfmpegMixedCross
 O0O1O2O3O0O1O2O3O0O1O2O3-
O00.880.760.650.690.970.960.940.670.910.770.700.670.75
O10.740.810.700.630.960.970.970.670.650.810.670.660.79
O20.580.670.790.740.970.960.970.670.640.690.780.710.81
O30.580.630.750.780.640.640.640.870.690.740.740.830.76
Table 4. Cross-optimization Patch Detection Accuracy of Robin
Result Discussion . We manually analyze false positive and false negative cases to summarize two reasons. First, during the patch localization phase, Robin mistakenly takes some changed blocks as the patch blocks to generate ineffective MFI. Using the MFI will result in both false positive and false negative cases in the patch detection. Second, The memory layout of structure may be slightly changed from different versions of the program. Since Robin accesses the member variable of the structure objects by the memory offset, it may evaluate the wrong variable due to the changes. Thus, it will produce false positive cases and false negative cases.

4.2.2 Related Works Comparison.

We have selected the most relevant state-of-the-art patch presence detection tools BinXray, Fiber, PMatch and compared their accuracy against Robin. We use the same dataset in Section 4.2 to measure their cross-optimization and cross-compiler abilities to detect the patched functions. Since BinXray is tested with O0 optimization level in [56], we follow the same settings and generate the signatures in this optimization. The signatures are generated from binaries that are compiled with O0 optimization level. Then, the signatures are used to detect the patch presence in target binaries compiled from all (O0—O3) optimization levels using BinXray. For Fiber and PMatch, we prepare them for detection by following their official recommendations. As to Fiber, its signature generation requires elaborate prepared patch data, including patch code (in form of git commit code), software source code of different versions, and symbol tables of target software binaries. Among 287 CVEs, 210 patch data are successfully gathered where commit code is downloaded by following URLs from NVD websites. Consequently, Fiber does not support 77 CVEs that lack patch commit code. We employ Fiber to extract patch signatures for 210 CVEs, but only 65 patch signatures are successfully generated. The fundamental cause of patch signature failure is that the root instruction, which is the primary reference for signature generation, cannot be found by Fiber. Considering Fiber is designed to detect patch in Android kernel, we believe that fiber’s scalability in common software requires additional adaptations. This conclusion is supported by research [48] that conducts comparative studies with Fiber.
Table 5 shows the non-cross optimization and cross-optimization comparison results between these tools. The first row gives the name of the tools. The second row gives optimization levels at which the target binaries are compiled. The third row denotes the number of test cases used in each of the sub-experiments. Since the target functions may be inlined into other functions when the binary is compiled with high optimization levels, the number of test cases decreases as the optimization level increases. The fourth and fifth rows denote the number of cases and the percentage on which tools successfully conduct the patch detection and output results. Robin and PMatch support all the test cases while BinXray and Fiber support less and less as the optimization level increases. The sixth row denotes the accuracy of patch detection among the supported cases. The last row in Table 5 displays the bold values that exhibit the overall accuracy in detecting patch functions compiled by various optimization levels. These values reflect the detection accuracy of different tools in real-world scenarios.
Table 5.
Tool NameRobinBinXrayFiberPMatch
 O0O1O2O3O0O1O2O3O0O1O2O3O0O1O2O3
Test Cases817698669621817698669621817698669621817698669621
Supported #81769866962166523120972186167167152817698669621
Support Rate100.0%100.0%100.0%100.0%81.4%33.1%31.2 %11.6%22.7%23.9%25.0%24.5%100.0%100.0%100.0%100.0%
Acc in Support0.9250.7660.7620.7150.9500.8600.8920.8490.5480.5270.5270.5060.9090.6860.6520.504
Overall Acc0.9250.7660.7620.7150.7760.2900.2650.1580.1250.1260.1320.1240.9090.6860.6520.504
Table 5. Cross-optimization Detection Comparison
Acc is short for accuracy.
Non-cross Optimization Comparison. . The columns “O0” give the non-cross optimization detection results since the patch signatures are generated from O0-compiled binaries. In general, Robin and PMatch are more robust than BinXray and Fiber since it supports all functions from all optimization levels. Whereas, BinXray only manages to generate the signatures for few cases when the optimization level is high. Similarly, Fiber can detect 186 test cases in O0 compilation optimization with a 54.8% accuracy rate. PMatch and Robin demonstrate a high degree of precision of 90.9% and 92.5%, respectively.
Cross Optimization Comparison. . The columns “O1”—“O3” show the cross optimization detection results. As compilation optimization grows, BinXray and Fiber exhibit increasingly poor scalability. For example, BinXray and Fiber support patch detection for 72 and 152 out of 621 target functions, respectively, under O3 optimization. PMatch and Robin exhibit excellent scalability for various optimizations. However, accuracy decreases when PMatch detects patch spanning optimizations in line with other tools’ trends. The decrease in accuracy is due to the fact that PMatch’s patch detection is mostly reliant on retrieved distinct code blocks from target functions by diffing them with vulnerable functions. When the compilation optimizations of target functions and vulnerable functions are inconsistent, the extracted different code blocks are notably distinct from reference blocks in signatures, and PMatch is unable to detect patched code blocks from them. When optimization rises, Robin’s detection precision remains quite high (71.5%–76.6%).
Case Study on CVE-2015-3196 . As an illustration of Robin’s superiority, we use the vulnerability confirmation and patch detection of CVE-2015-3196 [13] under O1 optimization. The vulnerability CVE-2015-3196 exists in OpenSSL 1.0.1 versions prior to 1.0.1p, which includes OpenSSL 1.0.1a-1.0.1o, and is patched in OpenSSL 1.0.1p-1.0.1u (1.0.1u is the latest version). The target function from O1-compiled OpenSSL 1.0.1u is selected as a patch function. For BinXray, patch detection begins with a comparison between the target function and the vulnerable function signature from O0-compiled OpenSSL 1.0.1o. Since the code syntactic differences between the target function and the signature function are substantial due to distinct compilation optimization, the diffing process outputs numerous different code blocks. The following phase, code trace generation, cannot be performed because the number of paths that traverse these code blocks has exploded. Consequently, BinXray is unable to make the decision on the target function. PMatch [37] fails to detect a patch in the target function for the same reason, as it mainly relies on diffing results. Fiber [63] is incapable of identifying patch in stripped binaries since it relies entirely on debug information, which is eliminated in stripped binaries. Since it employs MFI to execute function entry-to-patch code, Robin is able to do precise patch detection on the target function. As long as the semantic execution logic is consistent, Robin is tolerant to optimization-introduced syntactic code modifications.

4.2.3 Cross-compiler Detection.

To detect the cross compiler detection accuracy, we have selected two projects (OpenSSL and Tcpdump) with the most number of test cases and compile them using different compilers, GCC, ICC (Version 2021.1) and Clang (v6.0). We compile OpenSSL and Tcpdump with different compilers ICC (Version 2021.1) and Clang (v6.0) since OpenSSL and Tcpdump hold the most test cases. During the compilation, we set the compiler optimization level to O0. We conduct patch presence detection on binaries compiled by different compilers (i.e., ICC-x86, Clang-x86) and different architectures (i.e., GCC-arm) with CVE signatures from GCC-x86. Table 6 presents the supported test cases and the accuracy of Robin and baseline tools for cross-compiler patch detection. It presents the bold values showing the detection results of the best-performing tool, Robin. These values serve as a benchmark for comparison with the results obtained from other tools. Robin achieves high accuracy on both Clang-compiled binaries and ICC-compiled binaries, with 77.4% and 83.6% accuracy, respectively. Since the Clang compiler uses different stacks and registers in binaries, the detection accuracy is slightly lower than ICC. Robin also supports all test cases since it can feed MFI to any functions and perform function execution to get semantic features. The detection accuracy is low for BinXray and Fiber because the block changes significantly across different compilers. The syntactic features are not stable enough so that BinXray and Fiber fail to match them with the original signatures. Due to syntactic changes in the code, PMatch has high scalability but low detection accuracy. Since we can detect the architecture information from the file header of a binary, it is not necessary to perform cross-architecture patch detection in the real world. However, our approach can also be applied to different architectures. We run Robin to conduct patch detection on target functions from ARM binaries. The total number of test cases is 347 and Robin achieves an 87.7% accuracy.
Table 6.
CompilerClangICC
 Supported #AccSupported #Acc
BinXray11915.8%9316.5%
Fiber8410.8%6412.5%
PMatch34565.8%34568.3%
Robin34577.4%34583.6%
Table 6. Cross-compiler Detection Result Comparison

4.2.4 Scalibility of Robin.

We seek to study the Robin’s limitation and explore how Robin scales in different size functions.
MFI Generation . Robin generates 287 MFIs out of 292 CVEs, with five CVEs failing to generate MFIs. For failed cases, we manually examine the candidate paths to the patch code and discover that the patch code (especially the mitigation point) is located after a large number of indirect jump instructions, making it difficult for the symbolic execution technique to determine the correct target jump address. For instance, the patch code for CVE-2017-13051 [14] is situated in a “switch” statement that has multiple “case” branches. In assembly code, the switch statement is implemented by a jump table, which is a table storing the addresses of several case statements. The patch code cannot be reached due to the inability of symbolic execution to determine which address should be selected from the jump table.
Accuracy in Different Size Functions . We do a statistical analysis on the size (i.e., the number of basic blocks) distribution of a total of 871 tested functions and the accuracy of patch detection among them. The first column of Table 7 lists function size ranges, while the second column provides the number of functions that fall within each size range. The range of function sizes is from 5 to 1,000, and the majority of function sizes are less than 205. The third column displays the accuracy of patch detection for functions of various sizes. The detection accuracy ranged from 0.833 to 0.931, indicating reasonably accurate detection, and demonstrates no trend of diminishing accuracy as the size of the function increased. The MFI size distribution on a total of 187 signature functions and accompanying detection accuracy are detailed in Table 8. The first column indicates the length (i.e., the number of basic blocks in the path) of the path where MFI is created. Additionally, it can approximate the distance or depth between the patch block and the function entry. In the second column, the number of signature functions (i.e., patched functions) utilized in MFI construction and signature generation is displayed. The length of the path spans from 1 to 60, with (5, 10) accounting for the bulk, or 47. The third column provides the detection precision by using the MFI listed in the first column. The accuracy ranges between 0.888 and 0.93, indicating a consistent detection performance. In terms of both function size and MFI path length, Robin exhibits a rather good level of detection accuracy for a variety of function sizes and path lengths. In other words, Robin’s scalability for patch detection is robust.
Table 7.
Func SizeFunc #Detection Accuracy
(5, 25]730.896
(25, 45]1820.907
(45, 65]1620.906
(65, 85]1100.931
(85, 105]380.833
(105, 165]1050.884
(165, 205]1050.913
(205, 250]410.925
(250, 1000]550.917
Table 7. CVE Function Size Distribution and Detection Accuracy in Target Functions
Table 8.
MFI Path LengthSig Func #Detection Accuracy
(1, 5]380.887
(5, 10]470.909
(10, 15]350.916
(15, 20]240.880
(20, 25]140.930
(25, 30]100.906
(30, 60]190.914
Table 8. MFI Path Length Distribution in Signature Functions and Detection Accuracy in Target Functions
Answering RQ1: Robin effectively predicts the patch presence with 80.0% accuracy on average across different optimization levels from O0 to O3, and 80.5% accuracy on average across different compilers. Compared with other baseline methods, Robin outperforms them across different compilation settings with larger vulnerability coverage and higher accuracy.

4.3 Performance Evaluation (RQ2)

Figure 4(a) reports the average breakdown time of offline CVE signature generation and the average online detection time. As shown in Figure 4(a), PBS takes 0.0003 seconds. FPH takes 14.8 seconds, which counts the time of finding a feasible path for one changed block. On average, each patched function contains 6.1 changed blocks. IR takes 0.26 seconds. The Input-drive Execution (IE) takes 0.38 seconds, which includes signature extraction time. IS takes 1.76 seconds. The whole offline phase of Robin takes an average of 133 seconds as shown in Figure 4(b), and the online detection time of one function is 0.5 seconds on average.
Fig. 4.
Fig. 4. Performance of Robin.
Comparison With Baseline Tools. As shown in Figure 4(b), Robin’s offline phase requires more time than other baseline tools. In particular, Robin, BinXray, PMatch, and Fiber require an average of 133, 0.2, 30, 5.9, and 6.3 seconds, respectively. Robin, BinXray, and PMatch require less average time per function during the detection phase, i.e., 0.518 s, 1.020 s, and 0.125 s, respectively. Except for Robin and BinXray, which require only two functions (vulnerable and patched functions) to work properly, PMatch requires the manual selection of patch code blocks, and Fiber requires source code preparation and debugging information to be present in the target binary, making both tools considerably less scalable.
Compared with BinXray’s paper, the performance of BinXray declines because: (1) In the experimental mode of BinXray, binaries produced with the same optimization are detected for patches. In contrast, our evaluation measures BinXray’s performance on binaries produced with various optimizations. As a result, the calculation time increases dramatically as the number of changed blocks rises. (2) BinXray requires varying amounts of time in various applications. For example, according to BinXray [56], software binutils and openssl require 911.32 ms and 246.47 ms per function to execute BinXray, respectively. Since openssl accounts for 36.6% of functions and binutils accounts for only 0.06% of functions, the average detection time is low. Openssl accounts for 10.2% of the functions in our dataset, while binutils accounts for 13.5%. Consequently, the box plot in Figure 4(b) demonstrates that BinXray has a longer detection time. Since Robin executes the functions and BinXray only uses syntax information matching, Robin has a competitively good performance. Moreover, BinXray cannot support many cases of accuracy detection due to the path exploration problem. If we count the time wasted in attempting to generate signatures in the unsupported cases, the average performance of BinXray will be worse than Robin. Considering the significant accuracy improvement by Robin, the tradeoff between performance and accuracy via using semantic features is acceptable.
Answering RQ2: Robin manages to finish the vulnerability detection for the real-world program in a reasonable amount of time. Regarding the performance of two scalable tools, Robin and BinXray, Robin has better performance on online patch detection and is more scalable in real-world program patch detection.

4.4 Vulnerability Detection Improvement (RQ3)

The primary application of Robin is to reduce false positive cases while retaining the recall rate in vulnerability detection results obtained from function matching tools. We applied Robin and BinXray for patch detection on the function matching results as discussed in Section 2.1. These two tools re-score and re-rank the top 50 candidate functions in the function matching results, with the aim of making vulnerable functions rank higher (i.e., close to (1) and patched functions rank lower (i.e., close to 50). We used the same metrics as introduced in Section 2.1, which are provided in Table 9. In the first column of Table 9, we display the combination of BCSD and patch detection tools used. For example, “Gemini + BinXray” indicates that we first used the BCSD tool “Gemini” to determine the top 50 most similar functions to a given vulnerable function, which may contain patched functions. The integration of Robin (i.e., [BCSD tool] + Robin) showed that Recall@Top-1 was retained at 85.71%, while Recall@Top-5 improved from 91.66% to 95.24% and Recall@Top-10 improved from 91.66% to 97.62% on average. Notably, the Recall@Top-5 and Recall@Top-10 of Gemini increased to 100%. Table 9 also shows the FPRs after applying Robin and BinXray to Gemini, SAFE, and Bingo in vulnerability detection. The results reveal that Robin significantly reduces FPRs more than BinXray, whose scalability is inadequate (as also indicated in RQ2). In particular, Robin manages to reduce the FPRs at the top 10 results from 89.13% to 4.34% for Gemini, from 84.78% to 6.52% for SAFE, and from 93.48% to 4.34% for Bingo. In contrast, BinXray reduces FPR@Top-1 to around 55%, indicating that BinXray is still missing almost 50% of patched functions. Poor scalability is primarily caused by redundant code modifications in target patched functions (e.g., target patched functions adding code after the signature patched function) or the inability of BinXray to locate matched code traces in patched functions.
Table 9.
Tool CombinationRecall@Top-1Recall@Top-5Recall@Top-10FPR@Top-1FPR@Top-5FPR@Top-10
Gemini+BinXray53.57%67.86%78.57%57.69%76.92%84.61%
Gemini+Robin82.14%100%100%2.17%4.34%4.34%
SAFE+BinXray55.56%70.37%74.07%46.15%65.38%76.92%
SAFE+Robin82.14%89.29%96.43%4.34%4.34%6.52%
Bingo+BinXray64.29%75.0%82.14%54.16%79.17%83.33%
Bingo+Robin92.85%96.43%96.43%2.17%4.34%4.34%
Table 9. Recall and FPR of Integration of BCSD and Patch Detection
From the table, we can see that the FPR after Robin detection is reduced significantly. We illustrate the improvement using the match and detection results of the OpenSSL project, which has the most CVEs. Figure 5 shows the ranking results of function matching (a) and patch detection (b) after applying Robin. In Figure 5(a), we plot the ranking of target functions among the candidate functions. Functions marked in red are vulnerable, and functions marked in green are patched. We can see that Gemini, SAFE, and Bingo rank both vulnerable and patched functions higher. For vulnerability detection tasks, these patched functions are considered false positives. In Figure 5(b), we plot the re-ranking results of Robin. Specifically, we use Robin to conduct vulnerability confirmation and patch detection on function matching results. The figure shows that Robin ranks patched function lower. Besides, Figure 6 (where the x-axis represents scores and the y-axis represents distribution density) shows that Robin gives all unrelated functions (in blue) scores around 0. It suggests that Robin can distinguish and filter out the irrelevant functions from the vulnerable and patched functions in the matching results. After re-scoring, we can also use thresholds to refine candidate functions with a low FPR and high recall, as shown in Table 10.
Table 10.
ThresholdFPRRecall
–0.20.090.82
–0.30.080.71
–0.40.050.68
–0.50.010.64
Table 10. Recall and FPR under Different Thresholds
Fig. 5.
Fig. 5. Rank distribution of vulnerable functions (in red) and patched functions (in green) before and after vulnerability confirmation.
Fig. 6.
Fig. 6. Function score distribution of Robin.
Listing 2.
Listing 2. Source Code of EC_KEY_get_key_method_data.

4.4.1 New Vulnerability Detection.

Traditional approaches, such as Gemini, focus on matching the same or similar functions in two binaries. Therefore, it aims at detecting 1-day vulnerabilities, which are recurring in different programs. But the results of function matching tools usually contain false positives, which may have new vulnerabilities since they share similar code patterns with vulnerable functions. Instead, Robin can not only trigger the vulnerabilities in the known vulnerable functions, but also detect new vulnerabilities from the false positive results. From Figure 6, we can see that there are unrelated functions with scores of \(-\)1. We manually analyze the candidates with a \(-\)1.0 score (i.e., similar to a vulnerable function signature). We detect 15 candidates with vulnerable behaviors. Among them, 3 candidates have been discovered before, and the rest 12 cases have the vulnerable behavior when given the inputs in the signature. we list the specifics of the vulnerable cases in Table 11. The first column specifies the CVEs that are used to generate the PoC inputs. The second and third columns specify the versions and candidate function names, which have a similarity score of \(-\)1.0. The fourth column specifies whether the function has been reported as a CVE already. The last column gives the file names and the line numbers of the vulnerable point after manual examination. From the table, we can notice that only three candidates have been discovered before. The rest of the 12 cases have vulnerable behavior when given the inputs in the signature. For example, the function RSA_check_key has a \(-\)1.0 score when matching the signature of CVE-2015-0289. The source code of function RSA_check_key is listed in Listing 2. It performs pointer dereferences without a null pointer detection at code if \((!key-\gt p || !key-\gt q || !key-\gt n || !key-\gt e || !key-\gt d)\). We also verified that one of its callers does not perform null pointer detection. Attackers can leverage the dereferences to manipulate the contents of the memory to break the system.
Table 11.
PoC of CVEVersionInsecure FunctionReported in CVEVulnerability Point
CVE-2015-0289 [5]1.0.1a\(PKCS7\_dataInit\)Ypk7_doit.c; line 275
1.0.1k\(PKCS7\_dataInit\)Ypk7_doit.c; line 275
1.0.1m\(UI\_process\)Nui_lib.c; line 469
1.0.1m\(ssl\_check\_serverhello\_tlsext\)Nt1_lib.c; line 2049
1.0.1m\(OCSP\_basic\_verify\)Nocsp_vfy.c; line 172
1.0.1m\(pkey\_GOST01cp\_encrypt\)Npmeth_lib.c; line 448
1.0.1m\(RSA\_check\_key\)Nrsa_chk.c; line 62
CVE-2015-0209 [4]1.0.1m\(compute\_wNAF\)Nec_mult.c; line 194
1.0.1m\(x509\_name\_canon\)Nx_name.c; line 346
1.0.1m\(ec\_GFp\_mont\_group\_set\_curve\)Necp_mont.c;line 195
1.0.1m\(b2i\_rsa\)Npvkfmt.c; line 350
1.0.1m\(ssl\_bytes\_to\_cipher\_list\)Nssl_lib.c; line 1434
1.0.1m\(DH\_check\)Ndh_check.c; line 89
CVE-2016-2181 [6]1.0.1m\(cms\_DigestAlgorithm\_find\_ctx\)Nevp_lib.c; line 258
CVE-2015-02881.0.1l\(X509\_to\_X509\_REQ\)Yx_pubkey.c; line 99
Table 11. Vulnerabilities Found in Different Versions of OpenSSL
Answering RQ3: The integration of Robin as a patch detection tool has resulted in a significant improvement in vulnerability detection results. While maintaining a high recall rate, Robin has significantly reduced the FPRs in the top 10 vulnerability matching results from 89.13% to 5.07% on average. It also detects 12 new vulnerabilities [11], which cannot be found by traditional matching-based approaches.

5 Discussion

5.1 Symbolic Execution Application

Can normal symbolic execution build MFI? Normal symbolic execution aims at precisely traversing program execution paths by considering all constraints. However, this approach can lead to scalability issues and produce numerous irrelevant and redundant execution paths. We conducted a test using normal symbolic execution based on angr to build MFI for 20 randomly selected CVEs, but unfortunately, no MFI was successfully built. For this quick initial test, we randomly chose 20 CVE functions and applied normal symbolic execution to find a feasible path starting from the program entry and ending at the mitigation point. The program entries of libraries were set to commonly used API functions such as “SSL_read” for OpenSSL. The executions failed to reach the mitigation point due to running out of memory and being terminated. In contrast, UCSE ignores the constraints outside the vulnerable functions and generates possible feasible paths to execute vulnerable or patched code for semantic analysis. Although some of these paths may not be feasible in real program execution, UCSE is more suitable when focusing on local code semantic analysis than normal symbolic execution. As shown in experiment results, Robin successfully built MFIs for 287 out of 292 CVEs and demonstrated a high patch detection precision of 92.5% for O0 optimization. These results indicate that UCSE is a better fit than normal symbolic execution for summarizing patch and vulnerability code semantics, based on which Robin achieved impressive performance.
Loop Unrolling and MFI Build: Does More Unrolling Help? Unrolling loops once is a common optimization technique used in symbolic execution to reduce the number of paths and improve analysis efficiency, as mentioned in prior studies [29, 34]. During the third phase of feasible path confirmation, loops are unrolled multiple times in blind symbolic execution to explore as many paths as possible. Further unrolling in the third phase can identify feasible paths that traverse multiple loops, which were skipped in the first two phases. Unrolling loops resulted in the generation of 287 MFIs out of 292 CVEs. However, in five failed cases, unrolling loops was not effective in finding feasible paths due to indirect jumps with undecidable jump addresses. These jump addresses are recorded in jump tables, making the path constraints difficult to solve. We believe that more unrolling will not be helpful in our design.
The Sequence of Faking Function Call Return. Faking function call returns can simplify path constraints, making it easier to find solutions and resulting in a faster MFI building process for Robin. However, this approach may lead to under-constrained paths that are not feasible in real program execution when the target function call affects the variables involved in the path constraints. Nevertheless, since the purpose of MFI is to drive execution towards the vulnerable or patched code, building MFI from under-constrained paths can still achieve this goal and capture the semantics of the vulnerable and patched code by executing them.

5.2 Limitations of Robin

Robin has the following limitations, which need to be overcome in future works. First, although Robin has found potentially vulnerable functions in Section 4.4.1, it is difficult to guarantee that the vulnerabilities are triggerable. In the future, we would like to build the connection between the MFI and the program-level PoC, and use PoC to verify the triggerability of the vulnerability. Second, Robin can trigger and detect NPD vulnerable functions, it can not handle buffer overflow vulnerabilities because the buffer boundary cannot be modeled with under-constrained symbolic execution [27]. Therefore, it measures the semantic similarity to predict the patch presence instead.

6 Related Works

In this section, we discuss the binary function matching-based vulnerability detection and patch presence detection works.
Function Matching Based Vulnerability Detection. Code clone-based vulnerability detection is one of the efficient static approaches to scan the binary programs. BLEX [26] measures the binary code similarity through calculating the memory access differences between two program traces. TRACY [22] divides the binary function into partial basic block traces and measures the similarity based on them, which helps to match function with different basic block optimization levels. DiscovRE [28] uses numeric features to filter out dissimilar functions and compares the control flow graph to determine the matching pairs. Genius [30] combines the control flow graph structural features and the function numeric features to match and detect vulnerabilities in firmware. Gemini [55] addresses the cross-platform code similarity detection problem by embedding the control flow graph of the binary function into numeric vectors and calculating the distance between the vectors to determine the similarity. Bingo [20] proposes selective inlining technique to match functions that are compiled at high optimization levels. Bingo-E [59] combines structural, semantic, and syntactic features to match functions across different compiler settings. It further introduces code execution method to boost the matching performance and accuracy. Asm2Vec [23] leverages deep learning models to learn the assembly code representation without any prior knowledge. It then matches the function by calculating the distance between the function embeddings. \(\alpha\)Diff [39] tries to automatically capture the binary function features via machine learning models and uses them to measure the similarity. Trex [43] applies a transfer-learning-based framework to automate learning execution semantics from functions’ micro-traces, which are forms of under-constrained dynamic traces. Jtrans [52] embeds control flow information of binary code into Transformer-based language models, by utilizing a novel jump-aware representation pre-training task. Vgraph [18] extracts three kinds of code properties from the contextual code, the vulnerable code, and the patched code to construct the vulnerability-related representations. [15, 24, 24, 31, 39, 62, 65] also leverage the machine learning models to measure the similarity between function pairs. There are also source code level vulnerability detection works [33, 38, 45, 46, 60, 61] using the function matching based approaches. These works aim at detecting the function clones in binary program across different compilation settings. However, when they are directly applied to search for 1-day vulnerability, they have very high FPR due to patches. Robin tries to help these works to filter out the false-positive cases by accurately identifying the patched functions. It makes the code clone detection a practical solution to search for 1-day vulnerabilities.
Patch Presence Detection. Several works[40, 56] have been proposed to identify the patched functions in the matching results. FIBER [63] first tries to detect the patches in Android Linux kernels with the help of the source code level patch changes. It leverages the symbol table to locate the binary functions and uses the codes changes to match for the patched functions. BinXray [56] adopts the novel basic block mapping method to locate changed basic blocks and proposes algorithm to extract the noise-prone patch signatures. It then uses the signature to determine the patched function after matching. PDiff [34] captures the semantics of the patch from learning the source code and performs the patch presence testing on binary-only downstream kernels. VMPBL [40] also aims at detecting the vulnerability and improves the matching accuracy using patched functions as one part of the signature. PMatch [37] retrieves patch code from target functions by manually selecting elaborate patch code. These approaches heavily rely on the syntax information so that their accuracy will drop when target functions are compiled in different settings, which alter the syntax much. Some of them also requires the source code information, which may be not practical when it is not available. Our tool aims at extracting the semantic information, which is robust when determine the patched function across different compiler optimization levels.
Patch Analysis. Patch Analysis has become popular to understand the software security. PatchScope [64] performs a large-scale review of the patch related code changes in software. It proposes a memory object access sequence model to capture the semantic of the patches. Vulmet [58] produces semantic preserving hot patches by learning from the official patches. BScout [21] predicts the patch presence in the java executables through linking the java byte-code semantic to the source code. ReDebug [32] finds unpatched code clones in OS-distrubution scale code by diffing the patched code and vulnerable code. VUDDY [35] creates function fingerprints by calculating the hash value of normalized function code. Then it conducts faster lookups between hash values. MVP [53] tries to use program traces to capture the vulnerability and patch semantic to find the recurring vulnerabilities in source code program.

7 Conclusion

In this work, we propose Robin to detect and confirm vulnerable and patched functions in binary files with high accuracy. It summarizes the vulnerability signature via locating the patches with function diffing and solving the function inputs with symbolic execution as the MFI. It feeds the MFI to the target functions and measures the program execution semantic differences between vulnerable functions and patched functions and uses them to determine the vulnerability and patch presence. The experimental results show that Robin can detect the vulnerability and predict the patches with high accuracy and affordable overheads across different compilers and optimization levels. It outperforms the state-of-the-art patch detection tools BinXray, PMatch, and Fiber by large margins under cross compiler and optimization settings. In addition, it improves the accuracy of state-of-the-art matching based vulnerability detection tools Gemini/SAFE/Bingo by reducing the FPR in top 10 results from 89.13%/84.78%/93.48% to 4.34%/6.52%/4.34% and discovers 12 potentially new vulnerabilities in real-world programs.

References

[1]
2010. angr. Retrieved February 15, 2021 from https://angr.io/
[2]
2015. CVE - CVE-2015-0288. Retrieved February 21, 2021 from https://cve.mitre.org/cgi-bin/cvename.cgi?name=cve-2015-0288
[3]
2018. the-overlooked-problem-of-n-day-vulnerabilities. Retrieved February 08, 2021 from https://www.darkreading.com/vulnerabilities-threats/the-overlooked-problem-of-n-day-vulnerabilities
[4]
2021. CVE - CVE-2015-0209 Retrieved February 23, 2021 from https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-0209
[5]
2021. CVE - CVE-2015-0289 Retrieved February 16, 2021 from https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2015-0289
[6]
2021. CVE - CVE-2016-2181 Retrieved February 23, 2021 from https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-2181
[7]
2021. GitHub - Z3Prover/z3: The Z3 Theorem Prover. Retrieved February 15, 2021 from https://github.com/Z3Prover/z3
[8]
2021. IDA Pro – Hex Rays. Retrieved February 15, 2021 from https://www.hex-rays.com/products/ida/
[9]
2021. NVD - Home. Retrieved February 16, 2021 from https://nvd.nist.gov/
[10]
2021. Source Code of Robin. Retrieved December 10, 2021 from https://github.com/shouguoyang/Robin
[11]
2021. Vulnerability information. Retrieved December 10, 2021 from https://github.com/shouguoyang/Robin/blob/master/vulnerability_note.md
[12]
2021. x86 Function Attributes. Retrieved February 08, 2021 from https://gcc.gnu.org/onlinedocs/gcc/x86-Function-Attributes.html
[13]
2022. cve-2015-3196. Retrieved December 29, 2022 from https://cve.mitre.org/cgi-bin/cvename.cgi?name=cve-2015-3196
[14]
2022. fix commit for CVE-2017-13051. Retrieved December 29, 2022 from https://github.com/the-tcpdump-group/tcpdump/commit/289c672020280529fd382f3502efab7100d638ec
[15]
Sunwoo Ahn, Seonggwan Ahn, Hyungjoon Koo, and Yunheung Paek. 2022. Practical binary code similarity detection with bert-based transferable similarity learning. In Proceedings of the 38th Annual Computer Security Applications Conference. 361–374.
[16]
Thanassis Avgerinos, Sang Kil Cha, Alexandre Rebert, Edward J. Schwartz, Maverick Woo, and David Brumley. 2014. Automatic exploit generation. Communications of the ACM 57, 2 (2014), 74–84.
[17]
Johann Blieberger and Bernd Burgstaller. 1998. Symbolic reaching definitions analysis of Ada programs. In Proceedings of the International Conference on Reliable Software Technologies. Springer, 238–250.
[18]
Benjamin Bowman and H. Howie Huang. 2020. VGRAPH: A robust vulnerable code clone detection system using code property triplets. In Proceedings of the 2020 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 53–69.
[19]
Sang Kil Cha, Thanassis Avgerinos, Alexandre Rebert, and David Brumley. 2012. Unleashing mayhem on binary code. In Proceedings of the 2012 IEEE Symposium on Security and Privacy. IEEE, 380–394.
[20]
Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu, Yang Liu, Chia Yuan Cho, and Hee Beng Kuan Tan. 2016. Bingo: Cross-architecture cross-os binary search. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 678–689.
[21]
Jiarun Dai, Yuan Zhang, Zheyue Jiang, Yingtian Zhou, Junyan Chen, Xinyu Xing, Xiaohan Zhang, Xin Tan, Min Yang, and Zhemin Yang. 2020. BScout: Direct whole patch presence test for java executables. In Proceedings of the 29th \(\lbrace\)USENIX\(\rbrace\) Security Symposium (\(\lbrace\)USENIX\(\rbrace\) Security 20). 1147–1164.
[22]
Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. Acm Sigplan Notices 49, 6 (2014), 349–360.
[23]
Steven H. H. Ding, Benjamin C. M. Fung, and Philippe Charland. 2019. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 472–489.
[24]
Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. Deepbindiff: Learning program-wide code representations for binary diffing. In Proceedings of the Network and Distributed System Security Symposium.
[25]
Zakir Durumeric, Frank Li, James Kasten, Johanna Amann, Jethro Beekman, Mathias Payer, Nicolas Weaver, David Adrian, Vern Paxson, Michael Bailey, and J. Alex Halderman. 2014. The matter of heartbleed. In Proceedings of the 2014 Conference on Internet Measurement Conference. 475–488.
[26]
Manuel Egele, Maverick Woo, Peter Chapman, and David Brumley. 2014. Blanket execution: Dynamic similarity testing for program binaries and components. In Proceedings of the 23rd \(\lbrace\)USENIX\(\rbrace\) Security Symposium (\(\lbrace\)USENIX\(\rbrace\) Security 14). 303–317.
[27]
Dawson Engler and Daniel Dunbar. 2007. Under-constrained execution: making automatic code destruction easy and scalable. In Proceedings of the 2007 International Symposium on Software Testing and Analysis. 1–4.
[28]
Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient cross-architecture identification of bugs in binary code. In Proceedings of the NDSS. 58–79.
[29]
Qian Feng, Minghua Wang, Mu Zhang, Rundong Zhou, Andrew Henderson, and Heng Yin. 2017. Extracting conditional formulas for cross-platform bug search. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. 346–359.
[30]
Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 480–491.
[31]
Jian Gao, Xin Yang, Ying Fu, Yu Jiang, Heyuan Shi, and Jiaguang Sun. 2018. Vulseeker-pro: Enhanced semantic learning based binary vulnerability seeker with emulation. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 803–808.
[32]
Jiyong Jang, Abeer Agrawal, and David Brumley. 2012. ReDeBug: finding unpatched code clones in entire os distributions. In Proceedings of the 2012 IEEE Symposium on Security and Privacy. IEEE, 48–62.
[33]
Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE’07). IEEE, 96–105.
[34]
Zheyue Jiang, Yuan Zhang, Jun Xu, Qi Wen, Zhenghe Wang, Xiaohan Zhang, Xinyu Xing, Min Yang, and Zhemin Yang. 2020. PDiff: Semantic-based patch presence testing for downstream kernels. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1149–1163.
[35]
Seulbae Kim, Seunghoon Woo, Heejo Lee, and Hakjoo Oh. 2017. Vuddy: A scalable approach for vulnerable code clone discovery. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 595–614.
[36]
S. Kiran Kumar and C. Pandu Rangan. 1987. A linear space algorithm for the LCS problem. Acta Informatica 24, 3 (1987), 353–362.
[37]
Zhe Lang, Shouguo Yang, Yiran Cheng, Xiaoling Zhang, Zhiqiang Shi, and Limin Sun. 2021. PMatch: Semantic-based patch detection for binary programs. In Proceedings of the 2021 IEEE International Performance, Computing, and Communications Conference (IPCCC). IEEE, 1–10.
[38]
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. 2016. Vulpecker: An automated vulnerability detection system based on code similarity analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications. 201–213.
[39]
Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou. 2018. \(\alpha\)diff: Cross-version binary code similarity detection with dnn. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 667–678.
[40]
Danjun Liu, Yao Li, Yong Tang, Baosheng Wang, and Wei Xie. 2018. VMPBL: Identifying vulnerable functions based on machine learning combining patched information and binary comparison technique by LCS. In Proceedings of the 2018 17th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/12th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE). IEEE, 800–807.
[41]
Kangjie Lu, Aditya Pakki, and Qiushi Wu. 2019. Detecting missing-check bugs via semantic-and context-aware criticalness and constraints inferences. In Proceedings of the 28th \(\lbrace\)USENIX\(\rbrace\) Security Symposium (\(\lbrace\)USENIX\(\rbrace\) Security 19). 1769–1786.
[42]
Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. Safe: Self-attentive function embeddings for binary similarity. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 309–329.
[43]
Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2022. Learning approximate execution semantics from traces for binary function similarity. IEEE Transactions on Software Engineering 49, 4 (2022), 2776–2790.
[44]
David A. Ramos and Dawson Engler. 2015. Under-constrained symbolic execution: Correctness checking for real code. In Proceedings of the 24th \(\lbrace\)USENIX\(\rbrace\) Security Symposium (\(\lbrace\)USENIX\(\rbrace\) Security 15). 49–64.
[45]
Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K. Roy, and Cristina V. Lopes. 2016. Sourcerercc: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering. 1157–1168.
[46]
Yusuke Sasaki, Tetsuo Yamamoto, Yasuhiro Hayase, and Katsuro Inoue. 2010. Finding file clones in FreeBSD ports collection. In Proceedings of the 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010). IEEE, 102–105.
[47]
Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
[48]
Peiyuan Sun, Qiben Yan, Haoyi Zhou, and Jianxin Li. 2021. Osprey: A fast and accurate patch presence test framework for binaries. Computer Communications 173 (2021), 95–106.
[49]
Michael Sutton, Adam Greene, and Pedram Amini. 2007. Fuzzing: Brute Force Vulnerability Discovery. Pearson Education.
[50]
Sami Ullah and Heekuck Oh. 2021. BinDiff NN: Learning distributed representation of assembly for robust binary diffing against semantic differences. IEEE Transactions on Software Engineering 48, 9 (2021), 3442–3466.
[51]
Ilja Van Sprundel. 2005. Fuzzing: Breaking software in an automated fashion. Decmember 8th (2005).
[52]
Hao Wang, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Gao, Han Qiu, Jianwei Zhuge, and Chao Zhang. 2022. Jtrans: Jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 1–13.
[53]
Yang Xiao, Bihuan Chen, Chendong Yu, Zhengzi Xu, Zimu Yuan, Feng Li, Binghong Liu, Yang Liu, Wei Huo, Wei Zou, and Wenchang Shi. 2020. MVP: Detecting vulnerabilities using patch-enhanced vulnerability signatures. In Proceedings of the 29th USENIX Security Symposium (USENIX Security 20). 1165–1182.
[54]
Yang Xiao, Zhengzi Xu, Weiwei Zhang, Chendong Yu, Longquan Liu, Wei Zou, Zimu Yuan, Yang Liu, Aihua Piao, and Wei Huo. 2021. VIVA: Binary level vulnerability identification via partial signature. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 213–224.
[55]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 363–376.
[56]
Yifei Xu, Zhengzi Xu, Bihuan Chen, Fu Song, Yang Liu, and Ting Liu. 2020. Patch based vulnerability matching for binary programs. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 376–387.
[57]
Zhengzi Xu, Bihuan Chen, Mahinthan Chandramohan, Y. Liu, and Fu Song. 2017. SPAIN: Security patch analysis for binaries towards understanding the pain and pills. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).462–472.
[58]
Zhengzi Xu, Yulong Zhang, Longri Zheng, Liangzhao Xia, Chenfu Bao, Zhi Wang, and Yang Liu. 2020. Automatic hot patch generation for android kernels. In Proceedings of the 29th \(\lbrace\)USENIX\(\rbrace\) Security Symposium (\(\lbrace\)USENIX\(\rbrace\) Security 20). 2397–2414.
[59]
Yinxing Xue, Zhengzi Xu, Mahinthan Chandramohan, and Yang Liu. 2018. Accurate and scalable cross-architecture cross-os binary code search with emulation. IEEE Transactions on Software Engineering 45, 11 (2018), 1125–1149.
[60]
Fabian Yamaguchi, Felix Lindner, and Konrad Rieck. 2011. Vulnerability extrapolation: Assisted discovery of vulnerabilities using machine learning. In Proceedings of the 5th USENIX Conference on Offensive Technologies. 13–13.
[61]
Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. 2012. Generalized vulnerability extrapolation using abstract syntax trees. In Proceedings of the 28th Annual Computer Security Applications Conference. 359–368.
[62]
Shouguo Yang, Long Cheng, Yicheng Zeng, Zhe Lang, Hongsong Zhu, and Zhiqiang Shi. 2021. Asteria: Deep learning-based AST-encoding for cross-platform binary code similarity detection. In Proceedings of the 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 224–236.
[63]
Hang Zhang and Zhiyun Qian. 2018. Precise and accurate patch presence test for binaries. In Proceedings of the 27th \(\lbrace\)USENIX\(\rbrace\) Security Symposium (\(\lbrace\)USENIX\(\rbrace\) Security 18). 887–902.
[64]
Lei Zhao, Yuncong Zhu, Jiang Ming, Yichen Zhang, Haotian Zhang, and Heng Yin. 2020. Patchscope: Memory object centric patch diffing. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 149–165.
[65]
Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2018. Neural machine translation inspired binary code similarity comparison beyond function pairs. In Proceedings of 26th Annual Network and Distributed System Security Symposium (NDSS’18).

Cited By

View all
  • (2024)Assembly Function Recognition in Embedded Systems as an Optimization ProblemMathematics10.3390/math1205065812:5(658)Online publication date: 23-Feb-2024
  • (2024)Vulnerabilities and Security Patches Detection in OSS: A SurveyACM Computing Surveys10.1145/369478257:1(1-37)Online publication date: 9-Sep-2024
  • (2024)CRABS-former: CRoss-Architecture Binary Code Similarity Detection based on TransformerProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671390(11-20)Online publication date: 24-Jul-2024
  • Show More Cited By

Index Terms

  1. Towards Practical Binary Code Similarity Detection: Vulnerability Verification via Patch Semantic Analysis

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Software Engineering and Methodology
    ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 6
    November 2023
    949 pages
    ISSN:1049-331X
    EISSN:1557-7392
    DOI:10.1145/3625557
    • Editor:
    • Mauro Pezzè
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 September 2023
    Online AM: 17 June 2023
    Accepted: 23 May 2023
    Revised: 22 April 2023
    Received: 24 July 2022
    Published in TOSEM Volume 32, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Patch detection
    2. vulnerability detection
    3. under constrained symbolic execution
    4. malicious function input

    Qualifiers

    • Research-article

    Funding Sources

    • National Key R&D Program of China
    • Strategic Priority Research Program of Chinese Academy of Sciences
    • Joint Fund Cultivation Project of National Natural Science Foundation of China
    • Science and Technology Project of State Grid Corporation of China
    • National Natural Science Foundation of China
    • Young Scientists Fund of the National Natural Science Foundation of China
    • Chinese National Natural Science Foundation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2,545
    • Downloads (Last 6 weeks)292
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Assembly Function Recognition in Embedded Systems as an Optimization ProblemMathematics10.3390/math1205065812:5(658)Online publication date: 23-Feb-2024
    • (2024)Vulnerabilities and Security Patches Detection in OSS: A SurveyACM Computing Surveys10.1145/369478257:1(1-37)Online publication date: 9-Sep-2024
    • (2024)CRABS-former: CRoss-Architecture Binary Code Similarity Detection based on TransformerProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671390(11-20)Online publication date: 24-Jul-2024
    • (2024)Enhancing Deep Learning Vulnerability Detection through Imbalance Loss Functions: An Empirical StudyProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671379(85-94)Online publication date: 24-Jul-2024
    • (2024)VulDet-BC: Binary Software Vulnerability Detection Based on BiGRU and CNN2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00183(1388-1397)Online publication date: 2-Jul-2024
    • (2024)CODE-SMASH: Source-Code Vulnerability Detection Using Siamese and Multi-Level Neural ArchitectureIEEE Access10.1109/ACCESS.2024.343232312(102492-102504)Online publication date: 2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media