1 Introduction
Recurring vulnerabilities, also known as 1-day vulnerabilities [
3], in open source libraries have spread widely due to code reuse and sharing [
53], which have become one of the most significant threats in cyber-security. For example, the HeartBleed bug (CVE-2014-0160) discovered in OpenSSL, as a 1-day vulnerability, has influenced 24% to 55% of popular HTTPS websites worldwide [
25]. There are two major ways to detect vulnerabilities, dynamic and static approaches. Among all dynamic approaches, fuzzing [
49,
51] is the traditional and most commonly used way to detect vulnerabilities within the software. It executes the program with mutated inputs and monitors the abnormal behaviors, which often suggest potential vulnerabilities. As a result, fuzzing can only test the code which the program execution can cover. As the program goes large, fuzzing can only test a small portion of it to find the vulnerability. Due to the limited code coverage, fuzzing can not confirm each given vulnerable function.
Static methods are superior to dynamic ones for detecting 1-day vulnerabilities because they yield fewer false negatives. Considering the widespread existence of 1-day vulnerabilities, minimal false negatives (i.e., fewer overlooked vulnerabilities) are becoming a significant factor. These works leverage the binary code similarity detection (i.e., function matching) technique [
20,
23,
28,
30,
35,
39,
55] by extracting kinds of signatures from the vulnerable functions, to find similar functions and take them as potentially vulnerable. The method is effective for covering all functions in the entire program. However, existing approaches focus more on improving the accuracy of function matching results to precisely detect the candidate functions that include the vulnerability. For example, DiscoveRE [
28] proposes multiple syntactic features and conducts matching between function
control flow graphs (
CFGs). Bingo [
20] combines both syntactic and semantic information and applies a wise inlining strategy to achieve a more accurate matching. Genius [
30] and Gemini [
55] propose a new representation of function:
Attributed CFG (
ACFG). Then they encode ACFGs into vectors using machine learning and deep learning technology and achieve a faster and more accurate match. BinDiff
\(_{NN}\) [
50] proposes a Siamese classification embedding network to highlight the function changes and measure the semantic similarity between functions. Although these approaches have high accuracy in matching functions with small changes, they tend to have a high
false-positive rate (
FPR) in finding vulnerabilities. The false-positive cases are due to the program patches. Specifically, the vulnerable function may be patched so that the vulnerability is no longer present. Patches are usually small code changes compared to the size of the function [
57]. The function matching algorithm is designed to be tolerant of subtle changes so that it will be very likely to match the patched function to the vulnerable function signature [
56].
Thus, the vulnerability matching results will be a mix of patched and vulnerable functions, which are difficult to differentiate and need more careful and laborious confirmation by experts.Syntactic-based patch presence detection algorithms are proposed to filter out the patches and improve the detection accuracy. For instance, in Fiber [
63], binary-level patch signatures are generated from source-level patches, and a signature match is conducted for a given binary function. The matching process in Fiber relies on syntactic features like CFGs and basic blocks, which are used to align instructions between target binaries and reference binaries. After alignment, Fiber generates symbolic constraints from patch-related code as patch signatures. PDiff [
34] employs a distinct approach by identifying the patch-related anchor blocks, slicing the function CFG based on them, and then extracting symbolic formulas from these paths to generate a patch digest. By quantifying the similarity of patch digests among patched, vulnerable, and target functions, PDiff determines the presence of a patch in the target functions. BinXray [
56] makes progress by performing detection in only binary-level patches. Specifically, it locates the patches in binary functions and extracts execution traces through patches as signatures. Then, it matches the signature in the given target function to confirm the patch presence. These approaches heavily rely on syntactic information, disregarding the potential semantic differences introduced by patches. Using only syntactic information can be precise in capturing subtle changes. However, when the source code is compiled into binary programs, the syntactic information can be easily changed by choosing different compilation settings [
20,
30,
42]. The patch detection accuracy will drop when matching the signatures, which are compiled from one setting, to the target binaries, which are compiled from other settings. For instance, PDiff [
34] fails to detect patches across binaries compiled by different optimizations because it relies on locating anchor blocks, which may be removed by optimization. Moreover, the results produced by function matching tools usually contain unrelated functions that are neither vulnerable nor patched [
54]. Existing approaches cannot distinguish vulnerable and patched functions from unrelated ones. Therefore, these approaches may not be practical for real-world programs.
Based on the limitations of existing function matching tools that may tolerate subtle patched code changes, and the shortcomings of current patch detection methods, we propose four key capabilities that a patch detection method should possess for effective verification of binary vulnerabilities:
C1.
The ability to detect binary functions even in the absence of source-level debug information.
C2.
Scalability to handle large binary programs efficiently.
C3.
Precise identification of patched functions and vulnerable functions.
C4.
Accurate detection of patches across different compilation optimizations by considering semantic differences rather than just syntactic differences.
C1 is necessary as patch detection is intended to be performed after function matching tools, which do not require access to source code, and it enhances the results of function matching by effectively eliminating patched functions. C2 ensures that the patch detection method is scalable and can handle real-world applications with a large number of functions effectively. C3 allows for the precise identification of vulnerable and patched functions, which can refine the results obtained from function matching. C4 is essential for the patch detection method to be capable of handling differences caused by compilation settings, particularly compilation optimizations. Considering that real-world binary similarity detection techniques often yield functions compiled with different compilation optimizations, and optimization information is usually unavailable, it is imperative to have patch detection techniques that can accurately detect patches across different compiler optimizations [
30,
55,
59].
To fulfill these four capabilities, we propose a semantic-based vulnerability confirmation tool called
Robin.
For C1, Robin identifies patched code that fixes vulnerabilities by comparing vulnerable and patched functions at the binary level using a diffing technique. Additionally,
Robin employs symbolic execution on binary functions to extract semantic features of the patch, enabling detection of patches or vulnerabilities in target binary functions.
For C2, drawing inspiration from [
16,
19],
Robin utilizes a lightweight symbolic execution technique called
malicious function input (
MFI) to efficiently generate function inputs that drive the execution from the function entry point to the patch code in patched functions or the vulnerability code in vulnerable functions.
For C3 and C4, Robin incorporates a functioning monitor that checks whether the input triggers the same vulnerable behaviors (e.g.,
null pointer dereference (
NPD)) in the target function or the same patched behaviors to confirm the presence of a vulnerability or patch. Furthermore,
Robin captures semantic behaviors from execution traces and uses behavior summaries (i.e., semantic features) to determine if the target function is vulnerable or patched. Compared to Fiber’s reliance on source code and syntactic feature matching,
Robin generates patch signatures solely from binaries and utilizes MFI to extract possible vulnerability semantics without requiring syntactic information. This allows
Robin to achieve better performance in patch presence detection across different optimization levels.
We have implemented a prototype of
Robin and made it open source [
10]. Our evaluation of
Robin on 287 real-world vulnerabilities in 10 software from different application domains shows that our method achieves an average accuracy of 80.0% and 80.5% for patch detection and filtering, respectively, across different optimization levels and compilers. These results outperform the state-of-the-art patch detection tools BinXray, PMatch, and Fiber by large margins. Furthermore, we have used
Robin to filter out patches in the results produced by other function matching tools, namely Gemini [
55], SAFE [
42], and Bingo [
20], and the results demonstrate a significant reduction in FPRs by 95.13%, 92.31%, and 95.36%, respectively, while also improving recall. Additionally,
Robin has detected 12 new potential vulnerable functions in our experiments.
In summary, our work makes the following contributions:
—
We conduct a study (Section
2.1) on function-matching-based vulnerability detection tools to demonstrate their inability to distinguish patches and vulnerabilities.
—
In this work, we propose inventively MFI, a carefully crafted function input that steers function execution to patched or vulnerable code. In addition, building MFI and using it for patch detection does not necessitate debug information from source code, making our tool more scalable in binary patch detection since patch source code is not always available.
—
We implement a prototype,
Robin, by employing MFI to detect the patch presence and verify the vulnerabilities across different optimizations, and open-source it [
10].
—
We evaluate Robin on 287 real-world vulnerabilities to show that Robin can identify the objective patched functions with 80.0% accuracy in 0.47 seconds per function.
—
We conduct patch detection on candidate functions output by function matching. The results show that after detection, the FPRs of function matching tools are significantly reduced by an average of 94.27% in the top 10 results. Besides, Robin detects 12 new potentially vulnerable functions.
4 Evaluation
We aim at answering the following research questions (RQs):
RQ1: How accurate is Robin for patch detection across different compilation optimization levels, different compilers, and different architectures, compared to state-of-the-art related works?
RQ2: What is the performance of Robin, compared to state-of-the-art related works?
RQ3: How much Robin can improve the accuracy of state-of-the-art function matching-based vulnerability detection tools?
We implement
Robin in Python with 8,592 lines of code, which supports Intel X86 32bit, 64bit, and ARM. We utilize IDA Pro [
8] and IDAPython to disassemble the binary functions and construct the CFGs, and we use the symbolic execution engine Angr (9.0.5171) [
1] and the theorem prover Z3Prover [
7] to solve the PoC constraints. All the programs run at Ubuntu server with 56 cores CPU of Intel Xeon E5-2697 @ 2.6 GHz and 256 GB memory.
4.1 Experiment Setup
4.1.1 Dataset.
To test the accuracy and the performance of
Robin, we select 10 real-world well-known projects from various application aspects (e.g., crypto and image process) and collect the corresponding vulnerability description (e.g., software name, vulnerable and patched versions, and function names) from NVD [
9]. For each vulnerability collected, we compile the vulnerable and patched version of project source code to extract the vulnerable and patched function with GCC 7.5.0. For patches being applied to multiple functions, we manually select the function(s), which contain(s) the vulnerability for analysis. Then, we select functions from versions before the patch as the vulnerable target functions and functions after the patch as the patched target functions to form the ground truth data for the experiment. Specifically, we choose first version and last version from both vulnerable version range and patched version range. For example, CVE-2015-0288 [
2] is a vulnerability in the OpenSSL version from 1.0.1a~1.0.1l. The versions after 1.0.1l (1.0.1m onward) are patched. We extract the vulnerable function from version 1.0.1l and the patched function from 1.0.1m. We regard versions 1.0.1a~1.0.1k as the vulnerable target versions and 1.0.1n or later versions as the patched target versions. We regard start version 1.0.1n and end version 1.0.1u from patched version range as the patched target versions since version 1.0.1u is the lastest version of OpenSSL 1.0.1.
Table
3 shows the dataset used to evaluate
Robin. In total, we compiled 209 different versions of programs from 10 different projects, which include 287 CVEs. The types of vulnerabilities consist of NPD, buffer overflow, integer overflow, double free, and use after free. Except for the compiler optimization level, we choose the default compilation configuration to build the binary so that it is closer to a real-world case. For each version of the program, we compile it with four different optimization levels (e.g., O0—O3). To test the detection accuracy for different compilers and architectures, we choose two projects (OpenSSL and Tcpdump) to compile and test since these two projects have the most number of test cases. We utilize ICC (version 2021.1) and Clang (version 6.0) to compile them for cross-compiler dataset with O0 optimization level, and we utilize arm-linux-gcc (v7.5.0) compiler to compile them for ARM architecture dataset with O0 optimization level.
4.1.2 Weight Assignment.
As shown in Equation (
3),
Robin combines four semantic feature scores with different weights
\(\alpha\),
\(\beta\),
\(\gamma\), and
\(\delta\). To determine these weights, we adopt the linear regression algorithm in machine learning [
47]. It learns the optimal weights to fit Equation (
3) for the purpose of predicting vulnerabilities and patches. To begin, we use
Robin to collect semantic features
\(\Delta _t\) of all O0 optimization target functions. We train and test the linear regression model for ten times using
\(\Delta _v\),
\(\Delta _p\),
\(\Delta _t\), as well as the target function label (i.e., 1 for patched function and
\(-\)1 for vulnerable function), Each time, we randomly divide the dataset according to a ratio of 8 : 2, where 80% for training and 20% for testing. Following each training and testing, we have a set of weight values for
\(\alpha\),
\(\beta\),
\(\gamma\), and
\(\delta\) and the corresponding test accuracy. We choose the set of weight values with the maximum accuracy, which is
\(\alpha =0.57\),
\(\beta =0.11\),
\(\gamma =0.18\), and
\(\delta =0.14\), with a 92.42% test accuracy.
4.2 Accuracy Evaluation (RQ1)
We select two vulnerable target functions and two patched target functions for each CVE in our ground truth dataset to form 817 test cases. And we regard functions with similarity scores greater than 0 are patched in the patch detection evaluation.
4.2.1 Cross-optimization Levels Patch Detection.
In cross-optimization detection, we utilize
Robin to conduct detection on binaries compiled with different compilation optimizations. Table
4 shows the cross-optimization level patch detection accuracy of
Robin. The rows show the optimization level used to compile the target programs and the columns give the optimization level used to compile the signatures. Specifically, the top half of columns 2 to 4 of the table shows the accuracy for big projects (i.e., OpenSSL, Binutils, Tcpdump), which have more than 100 test cases. The bottom half gives the accuracy of the rest small projects (i.e., Freetype, Ffmpeg, Mixed). For the last column,
\(Cross\) means the average accuracy matching the signature with the target compiled in different optimization levels for all projects.
\(No-Cross\) refers to the accuracy of matching CVE in the same optimization level. In Table
4, the bold values represent the detection accuracy in non-cross optimization settings, serving as a benchmark for the results obtained in cross-optimization settings. From the table, we can see that
Robin achieves high accuracy at all optimization level cross-matching combinations ranging from 60% to 98% accuracy with 80.0% on average. In General, predicting patch presence when the signature and target are with the same optimization level has higher accuracy (88%~93%) than across different optimization levels (75%~81%).
Result Discussion . We manually analyze false positive and false negative cases to summarize two reasons. First, during the patch localization phase, Robin mistakenly takes some changed blocks as the patch blocks to generate ineffective MFI. Using the MFI will result in both false positive and false negative cases in the patch detection. Second, The memory layout of structure may be slightly changed from different versions of the program. Since Robin accesses the member variable of the structure objects by the memory offset, it may evaluate the wrong variable due to the changes. Thus, it will produce false positive cases and false negative cases.
4.2.2 Related Works Comparison.
We have selected the most relevant state-of-the-art patch presence detection tools BinXray, Fiber, PMatch and compared their accuracy against
Robin. We use the same dataset in Section
4.2 to measure their cross-optimization and cross-compiler abilities to detect the patched functions. Since BinXray is tested with O0 optimization level in [
56], we follow the same settings and generate the signatures in this optimization. The signatures are generated from binaries that are compiled with O0 optimization level. Then, the signatures are used to detect the patch presence in target binaries compiled from all (O0—O3) optimization levels using BinXray. For Fiber and PMatch, we prepare them for detection by following their official recommendations. As to Fiber, its signature generation requires elaborate prepared patch data, including patch code (in form of git commit code), software source code of different versions, and symbol tables of target software binaries. Among 287 CVEs, 210 patch data are successfully gathered where commit code is downloaded by following URLs from NVD websites. Consequently, Fiber does not support 77 CVEs that lack patch commit code. We employ Fiber to extract patch signatures for 210 CVEs, but only 65 patch signatures are successfully generated. The fundamental cause of patch signature failure is that the root instruction, which is the primary reference for signature generation, cannot be found by Fiber. Considering Fiber is designed to detect patch in Android kernel, we believe that fiber’s scalability in common software requires additional adaptations. This conclusion is supported by research [
48] that conducts comparative studies with Fiber.
Table
5 shows the non-cross optimization and cross-optimization comparison results between these tools. The first row gives the name of the tools. The second row gives optimization levels at which the target binaries are compiled. The third row denotes the number of test cases used in each of the sub-experiments. Since the target functions may be inlined into other functions when the binary is compiled with high optimization levels, the number of test cases decreases as the optimization level increases. The fourth and fifth rows denote the number of cases and the percentage on which tools successfully conduct the patch detection and output results.
Robin and PMatch support all the test cases while BinXray and Fiber support less and less as the optimization level increases. The sixth row denotes the accuracy of patch detection among the supported cases. The last row in Table
5 displays the bold values that exhibit the overall accuracy in detecting patch functions compiled by various optimization levels. These values reflect the detection accuracy of different tools in real-world scenarios.
Non-cross Optimization Comparison. . The columns “O0” give the non-cross optimization detection results since the patch signatures are generated from O0-compiled binaries. In general, Robin and PMatch are more robust than BinXray and Fiber since it supports all functions from all optimization levels. Whereas, BinXray only manages to generate the signatures for few cases when the optimization level is high. Similarly, Fiber can detect 186 test cases in O0 compilation optimization with a 54.8% accuracy rate. PMatch and Robin demonstrate a high degree of precision of 90.9% and 92.5%, respectively.
Cross Optimization Comparison. . The columns “O1”—“O3” show the cross optimization detection results. As compilation optimization grows, BinXray and Fiber exhibit increasingly poor scalability. For example, BinXray and Fiber support patch detection for 72 and 152 out of 621 target functions, respectively, under O3 optimization. PMatch and Robin exhibit excellent scalability for various optimizations. However, accuracy decreases when PMatch detects patch spanning optimizations in line with other tools’ trends. The decrease in accuracy is due to the fact that PMatch’s patch detection is mostly reliant on retrieved distinct code blocks from target functions by diffing them with vulnerable functions. When the compilation optimizations of target functions and vulnerable functions are inconsistent, the extracted different code blocks are notably distinct from reference blocks in signatures, and PMatch is unable to detect patched code blocks from them. When optimization rises, Robin’s detection precision remains quite high (71.5%–76.6%).
Case Study on CVE-2015-3196 . As an illustration of
Robin’s superiority, we use the vulnerability confirmation and patch detection of CVE-2015-3196 [
13] under O1 optimization. The vulnerability CVE-2015-3196 exists in OpenSSL 1.0.1 versions prior to 1.0.1p, which includes OpenSSL 1.0.1a-1.0.1o, and is patched in OpenSSL 1.0.1p-1.0.1u (1.0.1u is the latest version). The target function from O1-compiled OpenSSL 1.0.1u is selected as a patch function. For BinXray, patch detection begins with a comparison between the target function and the vulnerable function signature from O0-compiled OpenSSL 1.0.1o. Since the code syntactic differences between the target function and the signature function are substantial due to distinct compilation optimization, the diffing process outputs numerous different code blocks. The following phase, code trace generation, cannot be performed because the number of paths that traverse these code blocks has exploded. Consequently, BinXray is unable to make the decision on the target function. PMatch [
37] fails to detect a patch in the target function for the same reason, as it mainly relies on diffing results. Fiber [
63] is incapable of identifying patch in stripped binaries since it relies entirely on debug information, which is eliminated in stripped binaries. Since it employs MFI to execute function entry-to-patch code,
Robin is able to do precise patch detection on the target function. As long as the semantic execution logic is consistent,
Robin is tolerant to optimization-introduced syntactic code modifications.
4.2.3 Cross-compiler Detection.
To detect the cross compiler detection accuracy, we have selected two projects (OpenSSL and Tcpdump) with the most number of test cases and compile them using different compilers, GCC, ICC (Version 2021.1) and Clang (v6.0). We compile OpenSSL and Tcpdump with different compilers ICC (Version 2021.1) and Clang (v6.0) since OpenSSL and Tcpdump hold the most test cases. During the compilation, we set the compiler optimization level to O0. We conduct patch presence detection on binaries compiled by different compilers (i.e., ICC-x86, Clang-x86) and different architectures (i.e., GCC-arm) with CVE signatures from GCC-x86. Table
6 presents the supported test cases and the accuracy of
Robin and baseline tools for cross-compiler patch detection. It presents the bold values showing the detection results of the best-performing tool, Robin. These values serve as a benchmark for comparison with the results obtained from other tools.
Robin achieves high accuracy on both Clang-compiled binaries and ICC-compiled binaries, with 77.4% and 83.6% accuracy, respectively. Since the Clang compiler uses different stacks and registers in binaries, the detection accuracy is slightly lower than ICC.
Robin also supports all test cases since it can feed MFI to any functions and perform function execution to get semantic features. The detection accuracy is low for BinXray and Fiber because the block changes significantly across different compilers. The syntactic features are not stable enough so that BinXray and Fiber fail to match them with the original signatures. Due to syntactic changes in the code, PMatch has high scalability but low detection accuracy. Since we can detect the architecture information from the file header of a binary, it is not necessary to perform cross-architecture patch detection in the real world. However, our approach can also be applied to different architectures. We run
Robin to conduct patch detection on target functions from ARM binaries. The total number of test cases is 347 and
Robin achieves an 87.7% accuracy.
4.2.4 Scalibility of Robin.
We seek to study the Robin’s limitation and explore how Robin scales in different size functions.
MFI Generation . Robin generates 287 MFIs out of 292 CVEs, with five CVEs failing to generate MFIs. For failed cases, we manually examine the candidate paths to the patch code and discover that the patch code (especially the mitigation point) is located after a large number of indirect jump instructions, making it difficult for the symbolic execution technique to determine the correct target jump address. For instance, the patch code for CVE-2017-13051 [
14] is situated in a “switch” statement that has multiple “case” branches. In assembly code, the switch statement is implemented by a jump table, which is a table storing the addresses of several case statements. The patch code cannot be reached due to the inability of symbolic execution to determine which address should be selected from the jump table.
Accuracy in Different Size Functions . We do a statistical analysis on the size (i.e., the number of basic blocks) distribution of a total of 871 tested functions and the accuracy of patch detection among them. The first column of Table
7 lists function size ranges, while the second column provides the number of functions that fall within each size range. The range of function sizes is from 5 to 1,000, and the majority of function sizes are less than 205. The third column displays the accuracy of patch detection for functions of various sizes. The detection accuracy ranged from 0.833 to 0.931, indicating reasonably accurate detection, and demonstrates no trend of diminishing accuracy as the size of the function increased. The MFI size distribution on a total of 187 signature functions and accompanying detection accuracy are detailed in Table
8. The first column indicates the length (i.e., the number of basic blocks in the path) of the path where MFI is created. Additionally, it can approximate the distance or depth between the patch block and the function entry. In the second column, the number of signature functions (i.e., patched functions) utilized in MFI construction and signature generation is displayed. The length of the path spans from 1 to 60, with (5, 10) accounting for the bulk, or 47. The third column provides the detection precision by using the MFI listed in the first column. The accuracy ranges between 0.888 and 0.93, indicating a consistent detection performance. In terms of both function size and MFI path length, Robin exhibits a rather good level of detection accuracy for a variety of function sizes and path lengths. In other words, Robin’s scalability for patch detection is robust.
4.3 Performance Evaluation (RQ2)
Figure
4(a) reports the average breakdown time of offline CVE signature generation and the average online detection time. As shown in Figure
4(a), PBS takes 0.0003 seconds. FPH takes 14.8 seconds, which counts the time of finding a feasible path for one changed block. On average, each patched function contains 6.1 changed blocks. IR takes 0.26 seconds. The
Input-drive Execution (
IE) takes 0.38 seconds, which includes signature extraction time. IS takes 1.76 seconds. The whole offline phase of
Robin takes an average of 133 seconds as shown in Figure
4(b), and the online detection time of one function is 0.5 seconds on average.
Comparison With Baseline Tools. As shown in Figure
4(b),
Robin’s offline phase requires more time than other baseline tools. In particular,
Robin, BinXray, PMatch, and Fiber require an average of 133, 0.2, 30, 5.9, and 6.3 seconds, respectively.
Robin, BinXray, and PMatch require less average time per function during the detection phase, i.e., 0.518 s, 1.020 s, and 0.125 s, respectively. Except for
Robin and BinXray, which require only two functions (vulnerable and patched functions) to work properly, PMatch requires the manual selection of patch code blocks, and Fiber requires source code preparation and debugging information to be present in the target binary, making both tools considerably less scalable.
Compared with BinXray’s paper, the performance of BinXray declines because: (1) In the experimental mode of BinXray, binaries produced with the same optimization are detected for patches. In contrast, our evaluation measures BinXray’s performance on binaries produced with various optimizations. As a result, the calculation time increases dramatically as the number of changed blocks rises. (2) BinXray requires varying amounts of time in various applications. For example, according to BinXray [
56], software binutils and openssl require 911.32 ms and 246.47 ms per function to execute BinXray, respectively. Since openssl accounts for 36.6% of functions and binutils accounts for only 0.06% of functions, the average detection time is low. Openssl accounts for 10.2% of the functions in our dataset, while binutils accounts for 13.5%. Consequently, the box plot in Figure
4(b) demonstrates that BinXray has a longer detection time. Since
Robin executes the functions and BinXray only uses syntax information matching, Robin has a competitively good performance. Moreover, BinXray cannot support many cases of accuracy detection due to the path exploration problem. If we count the time wasted in attempting to generate signatures in the unsupported cases, the average performance of BinXray will be worse than Robin. Considering the significant accuracy improvement by
Robin, the tradeoff between performance and accuracy via using semantic features is acceptable.
4.4 Vulnerability Detection Improvement (RQ3)
The primary application of
Robin is to reduce false positive cases while retaining the recall rate in vulnerability detection results obtained from function matching tools. We applied
Robin and BinXray for patch detection on the function matching results as discussed in Section
2.1. These two tools re-score and re-rank the top 50 candidate functions in the function matching results, with the aim of making vulnerable functions rank higher (i.e., close to (1) and patched functions rank lower (i.e., close to 50). We used the same metrics as introduced in Section
2.1, which are provided in Table
9. In the first column of Table
9, we display the combination of BCSD and patch detection tools used. For example, “Gemini + BinXray” indicates that we first used the BCSD tool “Gemini” to determine the top 50 most similar functions to a given vulnerable function, which may contain patched functions. The integration of
Robin (i.e., [BCSD tool] +
Robin) showed that Recall@Top-1 was retained at 85.71%, while Recall@Top-5 improved from 91.66% to 95.24% and Recall@Top-10 improved from 91.66% to 97.62% on average. Notably, the Recall@Top-5 and Recall@Top-10 of Gemini increased to 100%. Table
9 also shows the FPRs after applying
Robin and BinXray to Gemini, SAFE, and Bingo in vulnerability detection. The results reveal that
Robin significantly reduces FPRs more than BinXray, whose scalability is inadequate (as also indicated in RQ2). In particular,
Robin manages to reduce the FPRs at the top 10 results from 89.13% to 4.34% for Gemini, from 84.78% to 6.52% for SAFE, and from 93.48% to 4.34% for Bingo. In contrast, BinXray reduces FPR@Top-1 to around 55%, indicating that BinXray is still missing almost 50% of patched functions. Poor scalability is primarily caused by redundant code modifications in target patched functions (e.g., target patched functions adding code after the signature patched function) or the inability of BinXray to locate matched code traces in patched functions.
From the table, we can see that the FPR after
Robin detection is reduced significantly. We illustrate the improvement using the match and detection results of the OpenSSL project, which has the most CVEs. Figure
5 shows the ranking results of function matching (a) and patch detection (b) after applying
Robin. In Figure
5(a), we plot the ranking of target functions among the candidate functions. Functions marked in red are vulnerable, and functions marked in green are patched. We can see that Gemini, SAFE, and Bingo rank both vulnerable and patched functions higher. For vulnerability detection tasks, these patched functions are considered false positives. In Figure
5(b), we plot the re-ranking results of
Robin. Specifically, we use
Robin to conduct vulnerability confirmation and patch detection on function matching results. The figure shows that
Robin ranks patched function lower. Besides, Figure
6 (where the
x-axis represents scores and the
y-axis represents distribution density) shows that
Robin gives all unrelated functions (in blue) scores around 0. It suggests that
Robin can distinguish and filter out the irrelevant functions from the vulnerable and patched functions in the matching results. After re-scoring, we can also use thresholds to refine candidate functions with a low FPR and high recall, as shown in Table
10.
4.4.1 New Vulnerability Detection.
Traditional approaches, such as Gemini, focus on matching the same or similar functions in two binaries. Therefore, it aims at detecting 1-day vulnerabilities, which are recurring in different programs. But the results of function matching tools usually contain false positives, which may have new vulnerabilities since they share similar code patterns with vulnerable functions. Instead,
Robin can not only trigger the vulnerabilities in the known vulnerable functions, but also detect new vulnerabilities from the false positive results. From Figure
6, we can see that there are unrelated functions with scores of
\(-\)1. We manually analyze the candidates with a
\(-\)1.0 score (i.e., similar to a vulnerable function signature). We detect 15 candidates with vulnerable behaviors. Among them, 3 candidates have been discovered before, and the rest 12 cases have the vulnerable behavior when given the inputs in the signature. we list the specifics of the vulnerable cases in Table
11. The first column specifies the CVEs that are used to generate the PoC inputs. The second and third columns specify the versions and candidate function names, which have a similarity score of
\(-\)1.0. The fourth column specifies whether the function has been reported as a CVE already. The last column gives the file names and the line numbers of the vulnerable point after manual examination. From the table, we can notice that only three candidates have been discovered before. The rest of the 12 cases have vulnerable behavior when given the inputs in the signature. For example, the function RSA_check_key has a
\(-\)1.0 score when matching the signature of CVE-2015-0289. The source code of function RSA_check_key is listed in Listing 2. It performs pointer dereferences without a null pointer detection at code if
\((!key-\gt p || !key-\gt q || !key-\gt n || !key-\gt e || !key-\gt d)\). We also verified that one of its callers does not perform null pointer detection. Attackers can leverage the dereferences to manipulate the contents of the memory to break the system.
6 Related Works
In this section, we discuss the binary function matching-based vulnerability detection and patch presence detection works.
Function Matching Based Vulnerability Detection. Code clone-based vulnerability detection is one of the efficient static approaches to scan the binary programs. BLEX [
26] measures the binary code similarity through calculating the memory access differences between two program traces. TRACY [
22] divides the binary function into partial basic block traces and measures the similarity based on them, which helps to match function with different basic block optimization levels. DiscovRE [
28] uses numeric features to filter out dissimilar functions and compares the control flow graph to determine the matching pairs. Genius [
30] combines the control flow graph structural features and the function numeric features to match and detect vulnerabilities in firmware. Gemini [
55] addresses the cross-platform code similarity detection problem by embedding the control flow graph of the binary function into numeric vectors and calculating the distance between the vectors to determine the similarity. Bingo [
20] proposes selective inlining technique to match functions that are compiled at high optimization levels. Bingo-E [
59] combines structural, semantic, and syntactic features to match functions across different compiler settings. It further introduces code execution method to boost the matching performance and accuracy. Asm2Vec [
23] leverages deep learning models to learn the assembly code representation without any prior knowledge. It then matches the function by calculating the distance between the function embeddings.
\(\alpha\)Diff [
39] tries to automatically capture the binary function features via machine learning models and uses them to measure the similarity. Trex [
43] applies a transfer-learning-based framework to automate learning execution semantics from functions’ micro-traces, which are forms of under-constrained dynamic traces. Jtrans [
52] embeds control flow information of binary code into Transformer-based language models, by utilizing a novel jump-aware representation pre-training task. Vgraph [
18] extracts three kinds of code properties from the contextual code, the vulnerable code, and the patched code to construct the vulnerability-related representations. [
15,
24,
24,
31,
39,
62,
65] also leverage the machine learning models to measure the similarity between function pairs. There are also source code level vulnerability detection works [
33,
38,
45,
46,
60,
61] using the function matching based approaches. These works aim at detecting the function clones in binary program across different compilation settings. However, when they are directly applied to search for 1-day vulnerability, they have very high FPR due to patches.
Robin tries to help these works to filter out the false-positive cases by accurately identifying the patched functions. It makes the code clone detection a practical solution to search for 1-day vulnerabilities.
Patch Presence Detection. Several works[
40,
56] have been proposed to identify the patched functions in the matching results. FIBER [
63] first tries to detect the patches in Android Linux kernels with the help of the source code level patch changes. It leverages the symbol table to locate the binary functions and uses the codes changes to match for the patched functions. BinXray [
56] adopts the novel basic block mapping method to locate changed basic blocks and proposes algorithm to extract the noise-prone patch signatures. It then uses the signature to determine the patched function after matching. PDiff [
34] captures the semantics of the patch from learning the source code and performs the patch presence testing on binary-only downstream kernels. VMPBL [
40] also aims at detecting the vulnerability and improves the matching accuracy using patched functions as one part of the signature. PMatch [
37] retrieves patch code from target functions by manually selecting elaborate patch code. These approaches heavily rely on the syntax information so that their accuracy will drop when target functions are compiled in different settings, which alter the syntax much. Some of them also requires the source code information, which may be not practical when it is not available. Our tool aims at extracting the semantic information, which is robust when determine the patched function across different compiler optimization levels.
Patch Analysis. Patch Analysis has become popular to understand the software security. PatchScope [
64] performs a large-scale review of the patch related code changes in software. It proposes a memory object access sequence model to capture the semantic of the patches. Vulmet [
58] produces semantic preserving hot patches by learning from the official patches. BScout [
21] predicts the patch presence in the java executables through linking the java byte-code semantic to the source code. ReDebug [
32] finds unpatched code clones in OS-distrubution scale code by diffing the patched code and vulnerable code. VUDDY [
35] creates function fingerprints by calculating the hash value of normalized function code. Then it conducts faster lookups between hash values. MVP [
53] tries to use program traces to capture the vulnerability and patch semantic to find the recurring vulnerabilities in source code program.