research-article

Open access

FuSeBMC v4: Improving Code Coverage with Smart Seeds via BMC, Fuzzing and Static Analysis

Authors:

Kaled Alshmrany,

Mohannad Aldughaim,

Ahmed Bhayat,

Lucas CordeiroAuthors Info & Claims

Formal Aspects of Computing, Volume 36, Issue 2

Article No.: 12, Pages 1 - 25

https://doi.org/10.1145/3665337

Published: 26 June 2024 Publication History

PDF eReader

Abstract

Bounded model checking (BMC) and fuzzing techniques are among the most effective methods for detecting errors and security vulnerabilities in software. However, there are still shortcomings in detecting these errors due to the inability of existing methods to cover large areas in target code. We propose FuSeBMC v4, a test generator that synthesizes seeds with useful properties, that we refer to as smart seeds, to improve the performance of its hybrid fuzzer thereby achieving high C program coverage. FuSeBMC works by first analyzing and incrementally injecting goal labels into the given C program to guide BMC and Evolutionary Fuzzing engines. After that, the engines are employed for an initial period to produce the so–called smart seeds. Finally, the engines are run again, with these smart seeds as starting seeds, in an attempt to achieve maximum code coverage/find bugs. During seed generation and normal running, the Tracer subsystem aids coordination between the engines. This subsystem conducts additional coverage analysis and updates a shared memory with information on goals covered so far. Furthermore, the Tracer evaluates test-cases dynamically to convert cases into seeds for subsequent test fuzzing. Thus, the BMC engine can provide the seed that allows the fuzzing engine to bypass complex mathematical guards (e.g., input validation). As a result, we received three awards for participation in the fourth international competition in software testing (Test-Comp 2022), outperforming all state-of-the-art tools in every category, including the coverage category.

1 Introduction

Fuzzing is one of the essential techniques for discovering software bugs and is used by major corporations such as Microsoft [41] and Google [29]. Fuzzers construct inputs known as seeds and then run the program under test (PUT) on these seeds. The goal is to discover a bug by causing the PUT to crash. A secondary but essential goal is to cover as many program branches as possible since a bug occurring on a branch cannot be discovered if the branch is not explored. Broadly, fuzzers can be categorized in three ways. Firstly, blackbox fuzzers do not analyze the target program when generating seeds. Secondly, whitebox fuzzers extensively analyze the target programs to guide the seed generation to explore particular branches. Lastly, greybox fuzzing uses limited program analysis and feedback from the PUT to guide the input generation. There are also hybrid fuzzers combining a concolic executor with a coverage-guided fuzzing approach. A scheduling and synchronization mechanism synchronizes them in a coordination mode.

The main disadvantage of blackbox fuzzers is that due to the random manner in which they generate inputs, they are often unable to explore program paths with complex guards. Whitebox fuzzers, on the other hand, are very good at using program information to circumvent guards but are often slow and resource-intensive to run. Greybox fuzzing techniques such as the American Fuzzing Loop [2] offer a sweet spot regarding effort per input. However, they still have some fundamental weaknesses; most importantly, the straightforward way they generate seeds can lead to the fuzzer becoming stuck in one part of the code and not exploring other branches. Hybrid fuzzing attempts to circumvent this issue with more significant program-specific analysis. One common technique is concolic fuzzing, which involves using a theorem prover to solve path constraints and thereby helps the fuzzer to explore deeper into the program [8, 58, 72].

This article presents FuSeBMC, a state-of-the-art hybrid fuzzer incorporating various innovative features and techniques. This journal article is based on several published conference articles [4, 5, 6]. In practice, we concentrated on the enhancements made to FuSeBMC between 2021 (when our TAP article [4] was published) and 2022. In this journal article, we expand on these enhancements, such as using the Tracer subsystem, shared memory, and analyzing and ranking goals. In addition, we demonstrate the advancement achieved by carrying out a more thorough experimental evaluation. To summarize, we extend those articles by (i) discussing FuSeBMC in greater detail (Section 3), (ii) providing more examples, and (iii) providing a thorough and up-to-date experimental evaluation of the tool (Section 4).

One of the primary features of FuSeBMC is the linking of a greybox fuzzer with a bounded model checker. A bounded model checker works by treating a program as a state transition system and then checking whether there exists a path in this system of length less than some bound k that violates the property to be verified [18, 32]. In this work, we use ESBMC, an efficient bounded model checker for C++ and C with support for checking many safety properties fully automatically [32]. ESBMC works by translating the property to check and the bounded transition system into quantifier-free first-order logic. SMT-solvers are then run to find a model for \(\lnot P\) (the negated property) and C (the translated transition system). Finally, if a model is found, a counterexample is extracted, representing the set of assignments required to violate the property.

Bounded model checkers such as ESBMC are now mature software, used industrially [38] and capable of finding bugs in production software. We leverage this power of model checkers as a method for seed generation. During greybox fuzzing, if a particular branch has not been explored, ESBMC can provide a model (set of assignments to input variables) that will reach the branch. This model is then used as a seed for further greybox fuzzing. We evaluate seeds based on two criteria—the depth of the seed’s deepest goal and the number of goals covered specifically by the seed. Smart seeds are those that score high on these metrics. The technique is implemented in FuSeBMC [6], and it is available for download from GitHub.¹

An important FuSeBMC subsystem discussed in this article is the Tracer, which coordinates the bounded model checker and the various fuzzing engines. The Tracer monitors the test-cases produced by the fuzzers. It selects those with the highest impact (as measured by a couple of metrics discussed in Section 3) to act as seeds for future fuzzing rounds. Furthermore, as discussed above, ESBMC produces test-cases to cover particular branches. However, a test-case it produces may also cover branches other than the one targeted. To ascertain precisely which branches a test-case covers and thereby prevent ESBMC from running multiple times unnecessarily, the Tracer takes a test-case produced by ESBMC and runs the PUT on it, recording all goals covered.

Bounded model checking can be slow and resource-intensive. To mitigate against this, FuSeBMC does not use an off-the-shelf fuzzer for its grey box fuzzing but instead uses a modified version of the popular American Fuzzy Lop (AFL) tool. One of the features of this modified fuzzer is its ability to carry out lightweight static analysis of a program to recognize input verification. It analyzes the code for conditions on the input variables and ensures that seeds are only selected if they pass these conditions. This reduces the dependence on the computationally expensive bounded model checker for finding quality seeds. Another interesting feature of the modified fuzzer is that it analyzes the PUT and identifies potentially infinite loops heuristically. It then binds these loops in an attempt to speed up fuzzing. These bounds are incremented during the multiple fuzzing rounds.

Together, these features turn FuSeBMC into a leading fuzzer. In the 2022 edition of the Test-Comp software testing competition, FuSeBMC achieved first place in both the main categories, Cover-Error and Cover-Branches. In the Cover-Branches category, it achieved first place in 9 of the 16 subcategories it participated in. In the Cover-Error category, it achieved first place, or joint first place, in 8 out of the 14 subcategories that it participated in.

Contributions. This journal article explains the latest developments to the FuSeBMC fuzzer. The work presented here substantially extends our previous published conference articles [4, 5, 6]. FuSeBMC’s main new features can be summarized as follows:

—

The use of lightweight static analysis to recognize some forms of input validation on variables, thereby enabling fuzzing to produce more effective seeds and speed up the fuzzing process.

—

The prioritization of deeper goals with regards to finding test-cases as this can result in providing higher code coverage and generating fewer test-cases.

—

The setting of a loop unwinding depth during seed generation and fuzzing. As loop unwinding leads to exponential path explosion, we restrict the unwinding depth of each loop to a small number, depending on an approximate estimate of the number of program paths.

We also extend our previous articles by:

—

Explaining the working of the FuSeBMC tool in greater depth and clarity than previously.

—

Providing a detailed analysis of our participation in the international competition on software testing (Test-Comp 2022), where our tool FuSeBMC achieved three significant awards. FuSeBMC earned first place in all the categories by the improvements described in this article. We also thoroughly compare version 4 of the tool and the previous iteration, version 3, thereby demonstrating the effectiveness of our extensions.

2 Preliminaries

This section briefly introduces various fuzzing techniques, including general greybox fuzzing, white box fuzzing, and hybrid fuzzing. In addition, it explains the various code coverage metrics in a simplified manner. Lastly, it introduces the technique of bounded model checking and its application to automated test generation.

2.1 Fuzzing

Fuzzing is one of the most effective software testing techniques for finding security vulnerabilities in software systems. It automatically generates inputs, i.e., test-cases, passes them to a target program, and checks for abnormal behavior or code coverage. The generated input should be well-formed and accepted by the software system to be considered a valid test-case. The inputs’ structure can be, for example, very long or completely blank strings, min/max values of integers (or only zero and negative values), unique values, or characters likely to trigger bugs.

Figure 1 illustrates the traditional fuzzing approach [55]. It comprises two main phases: (1) input/test generation and (2) program execution and monitoring. Generally, fuzzing starts with the target program and initial inputs, which can be any file format (images, text, videos, etc.), network communication data, and executable binaries. Generating malformed inputs is the main challenge for fuzzers. So, commonly, two kinds of generators are employed in state-of-the-art fuzzers, generation-based, which generates the inputs from scratch, or mutation-based, which modifies existing inputs. Inputs are fed to target programs after being generated in the previous step. Then, the execution state is monitored by the fuzzer during the execution of the PUT to detect crashes or abnormal behaviors. When the fuzzer detects a violation, it stores the related test-cases for later usage and analysis.

Fig. 1.

As the primary goal of fuzzing is to find more crashes with applicable inputs, the most intuitive performance metric is the rate of crashes per time. However, crashes rarely occur, so many existing fuzzers are designed to maximize code coverage. They perform additional lightweight instrumentation to the PUT source code to enable coverage monitoring. By maximizing the code coverage, the fuzzer will test more paths in the program, which increases the likelihood of causing crashes [49]. Many undiscovered crashes occur in deep code paths, in parts of codes that are not executed frequently. Therefore, one of the critical roles of fuzzing is finding inputs that increase the code coverage. The above briefly outlines the principles of coverage guided grey-box fuzzing.

White-box fuzzing aims to aid the fuzzing process by extracting some additional information about the PUT structure. It combines fuzz testing with symbolic execution [42]. A white-box fuzzer symbolically executes the PUT with initial concrete inputs tracking every conditional statement, producing constraints over those inputs. The collected constraints capture how the program uses its inputs. The fuzzer then systematically negates each constraint and solves them using a constraint solver, creating new inputs that exercise different program paths [39]. For example, consider a symbolic variable i with an initial test-case where i is set to 0. If this is followed by a branch condition such as “if i= 10 then”, the symbolic execution process would generate a constraint \(i \ne 10\) resulting from the execution. A constraint solver can solve this formula and obtain a concrete value that would satisfy this new path by negating this constraint. This process can be repeated for the newly created inputs and can be used to maximize code coverage.

2.2 Code Coverage

One of the challenges in software testing is assigning a quality measure to test-cases such that higher-quality test-cases detect more bugs than low-quality test-cases [47]. These quality measures are called test adequacy criteria, and code coverage is the commonly used test adequacy criterion. Generally, code coverage is a measure of identifying the parts of PUT that execute when running the program and the degree to which a test-case covers the PUT. Choosing an appropriate coverage criterion is paramount to successful test generation [9, 45]. One of the research directions in this area involves moving from utilizing standard coverage criteria towards designing generic specification languages (e.g., Hyperlabel Test Objectives Language [57]) with higher expressive power for defining test objectives. However, one of the main challenges in applying such techniques is their unavailability in most off-the-shelf automated test generation tools.

Some of the most popular coverage criteria supported by most test generators include statement coverage, branch coverage, and function coverage. Statement coverage determines how many program statements have been executed at least once. Branch coverage tracks how many conditional program branches (if statements, switch statements, loop statements) have been visited. Function coverage measures how many functions (out of the total number of functions in the PUT) have been invoked at least once during the program execution.

2.3 Bounded Model Checking

Model checking [31] is a general verification technique that aims at determining whether a mathematical abstraction (in the form of a finite state transition system) of the underlying system satisfies the given specification (a set of properties formalized using temporal logic). A model-checking algorithm explores all reachable states of the system and checks whether the given properties hold in every state. If one of the properties fails, a counterexample—a sequence of system states leading to the failure—is produced.

Bounded model checking [18] solves a similar problem but for the bounded executions of the system to be verified. In other words, given a positive k, a bounded model checking algorithm tries finding a counterexample of maximum length k. In practice, if no such counterexample can be found, the value of k can be increased until one of the given properties is violated or the verification problem becomes intractable.

In software verification [30, 32] bounded model checking is typically accompanied by a symbolic execution of the given program up to the user-defined positive bound k. The obtained bounded symbolic traces are then automatically translated into a first-order logic formula C. It is used with a formula P representing a property that needs to be checked in the verified program to formulate a satisfiability problem \(C \wedge \lnot P\) . A counterexample to P has been discovered if this formula is satisfiable. Otherwise, it can be concluded that P holds up to the given context-bound k in the given program. In general, such satisfiability problems are solved by SAT/SMT solvers [33, 35, 60]. Many verification tools employing bounded model checking [37, 50] are capable of verifying functional correctness (i.e., program assertions) and code reachability, as well as automatically checking some implicit properties such as the absence of buffer overflows, dangling pointers, deadlocks, and data races. The exact set of such properties varies depending on the chosen verification tool.

In the context of automated test generation for software, bounded model checking is used for obtaining a sequence of concrete input values (as well as a concrete thread schedule for multi-threaded programs) that allows reaching a predefined location (program statement) within the program. The desired testing objectives determine the choice of such program locations.

2.4 Hybrid Fuzzing Techniques

In recent years, hybrid fuzzing techniques have been widely researched and discussed in the software security field [3, 5, 72]. Since fuzz testing alone fails to explore the complete program state space, it is often combined with a complementary verification technique such as bounded model checking or concolic execution [73]. Hybrid fuzzing typically involves combining multiple fuzzing techniques, each with its strengths and weaknesses, to achieve better results. Some common fuzzing techniques that may be integrated into hybrid approaches include Generational Fuzzing, Mutation-Based Fuzzing, Symbolic Execution, Concolic Testing, and Grammar-Based Fuzzing.

The main motivation behind combining such complementary techniques is to leverage the strengths of concolic execution and bounded model checking in generating inputs satisfying complex branch conditions, which are challenging to derive for mutation-based fuzzing. At the same time, fuzzing can quickly explore deep paths with simple checks that can offset the large resource consumption of concolic execution and bounded model checking.

3 FuSeBMC V4 Framework

FuSeBMC combines dynamic and static verification techniques for improving code coverage and bug discovery. It utilizes the Clang compiler [1] front-end to perform various code transformations, Efficient SMT-based Bounded Model Checking (ESBMC) [36, 37] as a BMC and symbolic execution engine, and a modified version of the AFL tool [2, 19] as well as a custom selective fuzzer [4] as fuzzing engines.

FuSeBMC takes a C program as input and produces a set of test-cases, maximizing code coverage while also checking for various bugs. Users can choose to check for several types of bugs that are supported by ESBMC (such as array bounds violations, divisions by zero, pointers safety, arithmetic overflows, memory leaks, and other user-defined properties). Figure 2 illustrates the FuSeBMC architecture and its workflow, and Algorithm 1 presents the main stages of the FuSeBMC execution.

Fig. 2.

3.1 Overview

FuSeBMC begins by analyzing C code and then injecting goal labels into the given C program (based on the code coverage criteria that we introduce in Section 3.2.1) and ranking them according to one of the strategies described in Section 3.2.2 (i.e., depending on the goal’s origin or depth in the PUT). From then on, FuSeBMC’s workflow can be divided into two main stages: seed generation (the preliminary stage) and test generation (the full coverage analysis stage). During seed generation, FuSeBMC applies the fuzzers and BMC to the instrumented code once for a short time to produce seeds that are used by the fuzzers at the test generation stage and test-cases that may provide coverage of some “shallow” goals. The intuition behind this divide is to quickly generate some meaningful seeds for the fuzzer that could increase the chances of exploring the PUT past the entry point, which often contains restrictive input validators that are hard to negotiate for the fuzzers. During test generation, the above engines are applied with a longer timeout while accompanied by another analysis engine called Tracer. It helps the execution of the fuzzers and the bounded model checker by recording which goal labels in the PUT have been covered by the test-cases produced by these engines. This is done to prevent the computationally expensive BMC engine from trying to reach an already covered goal. FuSeBMC continues with the test generation stage until all goals are covered or a timeout is reached.

In Figure 3, we introduce a short C program which we use as a running example to demonstrate the main code transformations throughout this section. The presented program accepts coefficients of a quadratic polynomial and an integer candidate solution in the range [1,100] as input from the user. It terminates successfully if the provided candidate solves the equation. However, the program returns an error if the given equation does not have real solutions or the input candidate value is outside the [1,100] range.

Fig. 3.

3.2 Code Instrumentation and Static Analysis

At this stage, FuSeBMC instruments the PUT and performs multiple static analyses. It takes the PUT (i.e., a C program) and a property file as inputs and produces three files: the instrumented program, Goal Queue, and Consumed Input Size.

3.2.1 Code Instrumentation.

FuSeBMC uses the Clang tooling infrastructure [1] at its front-end to parse the input C program and traverse the resulting Abstract Syntax Tree (AST), recursively injecting goal labels into the PUT. This process is guided by the FuSeBMC code coverage criteria. Namely, FuSeBMC inserts labels inside conditional statements, loops, and functions as follows:

—

For conditional statements: the label is inserted at the beginning of the block whether the statement is an if, else, or an instrumented empty else.

—

For loops: the label is placed at the beginning of the loop body and right after exiting the loop.

—

For functions: labels are injected at the beginning and at the end of the function body.

Furthermore, FuSeBMC adds declarations for several standard C library functions, such as “printf”, “strcpy”, “memset” and other C language functions, to ensure that we cover the majority of the functions that we may encounter in large programs while also maintaining the proper operation of our approach. The resulting instrumented code that has the labels injected is functionally equivalent to the original C program. Figure 3(b) demonstrates an example of the described code instrumentation for the program in Figure 3(a).

3.2.2 Static Analysis.

Apart from the required code instrumentation, Clang produces compilation error and warning messages and utilizes its static analyzer to simplify the input program (e.g., calculating the sizes of expressions, evaluating static asserts, and performing constants propagation) [51].

Furthermore, the Clang static code analyzer produces the Consumed Input Size, which represents the minimum number of bytes in the input stream required for fuzzing. This information plays an important role in enhancing the fuzzing process (see Section 3.4.1).

Another function of static analysis is to identify the ranges of the input variables. This is performed by collecting branch conditions that match a pattern \((x \circ val)\) , where x is the name of the variable, val is a numeric value, and \(\circ \in \lbrace \gt , \ge , \lt , \le , =, \not=\rbrace\) . Both fuzzers use this information to generate inputs only from the identified ranges.

Finally, FuSeBMC analyzes the instrumented code and ranks the injected goal labels. Each goal label is attributed with its origin information (i.e., if statement, while loop, end of function) and its depth in the instrumented program. Then FuSeBMC sorts all goals using one of the two strategies (line 3 of Algorithm 1): (1) based on their depth (i.e., depth-first search), or (2) based on their rank scores calculated as a product of a goal’s depth and its power score—a value between 1 and 5 describing the goal’s branching power. Each power score was decided upon via experimental analysis. The if statement goals are assigned a score of 5 (goals 4, 8, and 10 in Figure 3(b)), the function goals—4 (goals 1 and 2 in Figure 3(b), note that main function goals are scored differently), the loop goals—3 (goal 6 in Figure 3(b)) and the else goals—2 (goal 5 in Figure 3(b)). All remaining types of goals (i.e., end-of-main (goal 3 in Figure 3(b)), empty-else (goals 9, 11 in Figure 3(b)), and after-loop goals (goal 7 in Figure 3(b)) are assigned a value of 1.

In general, the goal sorting improves overall FuSeBMC performance. Using the depth-first strategy, FuSeBMC attempts to cover the deeper goals first. This is beneficial since all preceding goals on the path to a deep goal can be ignored during subsequent fuzzing as the same test-case covers them. On the other hand, the ranking strategy allows for the prioritization of conditional branches as they may lead to multiple goals that increase potential code coverage.

Figure 3(c) features the resulting goals tree for the instrumented code from Figure 3(b) (with GOAL_0 representing the entry point of the program, i.e., the main function). Note that FuSeBMC builds it based on the original Clang AST without analyzing the code for trivially unreachable goals. For example, labels GOAL_7 and GOAL_3 can never be reached during the program’s execution. However, this will not be reflected in the goals tree.

The goal’s depth value is assigned at its highest depth. Therefore, labels GOAL_1, GOAL_7, and GOAL_3 are assigned depth values of 5, 7, and 8, respectively. When the first ranking strategy is applied, two goals at the same depth are ordered in the ascending order of their label names. Using the second-ranking strategy, two goals with the same rank value are processed in the “power score first” manner. For example, GOAL_8 will be placed in front of GOAL_1 and GOAL_2 since it has a higher power score (5 vs. 4). Hence, FuSeBMC will process the goal labels in the following orders {3,7,10,11,1,2,8,9,6,4,5}, {10,8,1,2,4,6,3,7,11,5,9} using the first and second sorting strategies, respectively. Finally, the list of goals is stored in the shared memory as the Goal Queue. This queue can be modified by the BMC and Tracer engines during the consecutive stages to remove the goal labels that have been covered.

3.2.3 Shared Memory.

The set of data files that each component of FuSeBMC has access to (both for reading and writing) is called Shared Memory. Apart from Instrumented Code, Consumed Input Size and Goal Queue discussed above, it contains Seeds Store—a collection of seeds used by the fuzzer for test generation, test-cases—all test-cases generated by FuSeBMC, and Goal Covered Array—list of all goal labels that have been covered by the produced test-cases.

3.3 Seed Generation

Having ranked the goals, FuSeBMC carries out seed generation (line 5 of Algorithms 1 and 2 where it is described in detail) as a preliminary step before full coverage analysis (i.e., test generation ) begins. In this phase, FuSeBMC simplifies the target program by limiting loop bounds, and utilizes the information about the input ranges. Then FuSeBMC applies the fuzzer and the BMC engine (for 5 and 15 seconds, respectively) for a short time in succession.

Since the seed store is empty at this point, FuSeBMC performs primary seed generation (line 1 Algorithm 2) to enable the fuzzing process. This procedure involves generating binary seeds (i.e., a stream of bytes) based on Consumed Input Size and the input constraints collected during static analysis. In detail, it generates three sequences of bytes, where (1) all bytes have a value of 0, (2) all bytes have a value of 1, and (3) all byte values are drawn randomly from the identified input ranges. Then the fuzzer is initialized with the primary seeds and is run for a short time to produce test-cases that are then converted into new seeds and added to the Seeds store (see Figure 2 and lines 2–6 in Algorithm 2).

When the seed generation by fuzzing is finished, FuSeBMC executes the BMC engine for each goal label in the Goal Queue . To minimize the execution time, it is run with “lighter” settings: all implicit checks (i.e., memory safety, arithmetic overflows) and assertion checks are disabled, and the bound for loop unwinding is reduced. If a goal label is reached successfully, the BMC engine produces a witness—a set of program inputs that lead to that goal label. The sequence of these input values is added as a seed to the Seed Store.

All new seeds produced by the fuzzer and the BMC engine are deemed smart due to their powerful effect on code coverage. Conceptually, bounded model checkers use SMT solvers to produce test-cases that resolve complex branch conditions (i.e., guards). Such guards (for example, lines 5 and 12 in Figure 3(a)) pose a challenge to a fuzzer [72] as it relies on mutating the given seed randomly and is therefore unlikely to satisfy the branch condition. Seeds produced by BMC help solve this issue since they can be passed to a fuzzer, which can then advance deeper behind the complex guards into the target program (which is usually hard for a bounded model checker).

3.4 Test Generation

Following seed generation, FuSeBMC begins the main coverage analysis phase (lines 7–30 of Algorithm 1). FuSeBMC incorporates three engines to carry out this analysis: two fuzzers (main fuzzer and selective fuzzer) and a bounded model checker. Here, the main fuzzer and the BMC engine are run with longer timeouts than during the seed generation stage. Briefly, the fuzzers utilize the smart seeds produced at the previous stage, and generate test-cases by randomly mutating the program’s input and running it to analyze code coverage. The bounded model checker determines the reachability of particular goal labels similarly to the seed generation stage. FuSeBMC’s Tracer component aids the above engines by replaying a light-weight analysis of the produced test-cases and updating the Shared Memory. In the following subsections, we discuss the FuSeBMC components involved in coverage analysis in greater detail.

3.4.1 Main Fuzzer.

In FuSeBMC, we implement a modified version of the AFL tool [2]. The modified AFL generates test-cases based on the evolutionary algorithm implemented in AFL [19].

The standard algorithm implemented in the AFL tool works as follows. Firstly, an initial fixed-size input stream is generated using the provided seed (a random seed is used if not explicitly specified). Secondly, the target program is repeatedly executed with the randomly mutated input. If the target program does not reach any new states after multiple input mutation rounds, a new byte is added to or removed from the input stream, and the mutation process restarts. The above algorithm continues until an internal timeout is reached or the fuzzer finds inputs that fully cover the program. In general, the AFL’s mutation algorithm relies heavily on the quality of the initial seeds for higher code coverage. Therefore, generating seeds with higher coverage potential is crucial.

FuSeBMC modifies the original AFL fuzzer as follows:

(1)

It performs additional instrumentation to the PUT to minimize its execution overhead by limiting the bounds of loops heuristically identified as potentially infinite. Note that these bounds can be iteratively changed between the AFL runs.

(2)

The mutation operators are modified to generate only inputs from the ranges identified during the static code analysis.

(3)

It controls the size of the generated test-cases via the Consumed Input Size. In detail, the minimum size of the test-cases produced by the fuzzer is set to the current value of the consumed input size. This allows counteracting the size selection bias of the AFL mutation algorithms, which tend to favor a reduction of the number of bytes in the generated test-cases (instead of adding extra bytes) between the mutation rounds. At the same time, the modified fuzzer can control the maximum size of the produced test-cases. For example, when the Consumed Input Size starts growing gradually during the fuzzing process (a behavior often observed in programs accepting input in an infinite loop), the maximum test-case size is set to prevent performance degradation.

(4)

It outputs the list of goals covered by the produced test-cases and records them in Goals Covered Array.

3.4.2 Bounded Model Checker.

FuSeBMC uses ESBMC to check for the reachability of a given goal label within the instrumented program (lines 16–25 of Algorithm 1). If it concludes that the current goal is reachable it produces a counterexample that can be turned into a witness—a sequence of inputs that leads the program’s execution to that goal label—which is then used to generate a test-case. Every new test-case thus discovered is also added to the Seed Store to be used by the fuzzers. Even if the BMC runs out of time or memory, its progress in reducing the input ranges is saved as an incomplete seed—a sequence of input values that lead the PUT execution part of the way towards the given goal label.

3.4.3 Tracer.

The Tracer subsystem determines the goals covered by test-cases produced by the bounded model checker and the fuzzer. Whenever a test-case is produced, Tracer compiles the instrumented program together with the newly generated test-cases and runs the resulting executable. Before the compilation, it performs additional instrumentation to the test-case to output information about the PUT input size, the types of input variables, and the visited goals. This information is dynamically updated in the Shared Memory (i.e., Goals Covered Array and Consumed Input Size).

The Tracer also analyzes the test-cases produced by the other two engines to add the highest impact cases (i.e., the test-cases leading to new goals or reaching the maximum analysis depth) to the Seeds Store.

Another responsibility of Tracer is to handle the partial output of the bounded model checker when it reaches the timeout, outputting an incomplete counterexample. Tracer completes such counterexamples randomly and performs the coverage analysis and updates Seeds Store as described above.

3.4.4 Selective Fuzzer.

The selective fuzzer’s [4] main function is to attempt to reach the remaining uncovered goals after the iterative process of applying the fuzzer, and the BMC engine has finished. Similarly to the main fuzzer, it utilizes information about the identified ranges of the input variables to produce inputs for the PUT. At the same time, it implements a complementary test generation approach. It produces random values from the given input ranges—in contrast to the mutation-based approach used in the main fuzzer. The selective fuzzer terminates upon covering all the remaining goals or reaching the timeout.

4 Evaluation

4.1 Description of Benchmarks and Setup

To assess the performance of FuSeBMC v4, we evaluated its participation in Test-Comp 2022 [14], and also compared it to the results obtained by the previous version of the tool, FuSeBMC v3, in Test-Comp 2021 [13].

Test-Comp is a software testing competition where the participating tools compete in automated test-case generation. All test-case generation tasks in Test-Comp are divided into two categories: Cover-Branches and Cover-Error. The former requires producing a set of test-cases that maximize code coverage (in particular, branch coverage) for the given C program. The latter deals with error coverage: generating a test-case that leads to the predefined error location (i.e., explicitly marked error function) within the given C program. In Cover-Branches, code coverage is measured by the TestCov [17] tool, which assigns a score between 0 and 1 for each task. For example, if a competing tool achieves 80% code coverage on a particular task, it is assigned a score of 0.8 for that task and so forth. Overall scores for the subcategories are calculated by summing the individual scores for each task in the subcategory and rounding the result. In Cover-Error, each tool earns a score of 1 if it can provide a test-case that reaches the error function and gets a 0 score otherwise. Each category is further divided into multiple subcategories (see Tables 1 and 3) based on the most prominent program features and/or the program’s origin. Most programs in Test-Comp are taken from SV-COMP [12]—the largest and most diverse open-source repository of software verification tasks. It contains hand-crafted and real-world C programs with loops, arrays, bit-vectors, floating-point numbers, dynamic memory allocation, and recursive functions, event-condition-action software, concurrent programs, and BusyBox² software.

Table 1.

Subcategory	% average coverage		Improvement \(\Delta \%\)
Subcategory	FuSeBMC v4	FuSeBMC v3	Improvement \(\Delta \%\)
Arrays	82%	71%	11%
BitVectors	80%	60%	20%
ControlFlow	64%	22%	42%
ECA	37%	17%	20%
Floats	54%	46%	8%
Heap	73%	62%	11%
Loops	81%	71%	10%
ProductLines	29%	–	–
Recursive	85%	68%	18%
Sequentialized	87%	76%	11%
XCSP	90%	82%	8%
Combinations	61%	7%	53%
BusyBox	34%	1%	32%
DeviceDrivers	20%	12%	8%
SQLite-MemSafety	4%	0%	4%
Termination	92%	87%	5%
Cover-Branches	61%	45%	16%

Table 1. Comparison of the Average Coverage (Per Subcategory and the Category Overall) Achieved by FuSeBMC v4 and FuSeBMC v3 in the Cover-Branches Category in TestComp-2022 and TestComp-2021, Respectively

Table 2.

Task name	% coverage		Improvement \(\Delta \%\)
Task name	FuSeBMC v4	FuSeBMC v3	Improvement \(\Delta \%\)
pals_lcr.3.1.ufo.BOUNDED-6.pals+Problem12_label01.yml	94.90%	13.30%	81.60%
pals_lcr.3.1.ufo.UNBOUNDED.pals+Problem12_label02.yml	84.40%	5.19%	79.21%
pals_lcr.4.1.ufo.BOUNDED-8.pals+Problem12_label04.yml	94.10%	4.44%	89.66%
pals_lcr.4_overflow.ufo.UNBOUNDED.pals+Problem12_label05.yml	94.00%	11.50%	82.50%
pals_lcr.5.1.ufo.UNBOUNDED.pals+Problem12_label05.yml	86.20%	0.78%	85.42%
pals_lcr.5_overflow.ufo.UNBOUNDED.pals+Problem12_label09.yml	94.00%	4.82%	89.18%
pals_lcr.6.1.ufo.BOUNDED-12.pals+Problem12_label09.yml	92.90%	5.18%	87.72%
pals_lcr.7_overflow.ufo.UNBOUNDED.pals+Problem12_label09.yml	92.60%	5.31%	87.29%
pals_lcr.8.ufo.UNBOUNDED.pals+Problem12_label08.yml	78.20%	8.17%	70.03%
Average value	90.14%	6.52%	83.62%

Table 2. Comparison of Code Coverage Achieved by FuSeBMC v4 and FuSeBMC v3 in a Subset of Tasks from the Combinations Subcategory

Table 3.

Subcategory	% errors detected		Improvement \(\Delta \%\)
Subcategory	FuSeBMC v4	FuSeBMC v3	Improvement \(\Delta \%\)
Arrays	99%	93%	6%
BitVectors	100%	100%	0%
ControlFlow	100%	25%	75%
ECA	72%	44%	28%
Floats	100%	97%	3%
Heap	95%	80%	14%
Loops	93%	83%	10%
ProductLines	100%	-	-
Recursive	95%	95%	0%
Sequentialized	95%	94%	1%
XCSP	88%	90%	–2%
BusyBox	15%	0%	15%
DeviceDrivers	0%	0%	0%
Cover-Error	81%	67%	14%

Table 3. Comparison of the Percentages of the Successfully Detected Errors (Per Category and the Category Overall) by FuSeBMC v4 and FuSeBMC v3 in the Error Coverage Category in TestComp-2022 and TestComp-2021, Respectively

Both Test-Comp 2021 and Test-Comp 2022 evaluations were conducted on servers featuring an 8-core (4 physical cores) Intel Xeon E3-1230 v5 CPU @ 3.4 GHz, 33 GB of RAM and running x86-64 Ubuntu 20.04 with Linux kernel 5.4. Each test suite generation run was limited to 8 CPU cores, 15 GB of RAM, and 15 minutes of CPU time, while each test suite validation run was limited to 2 CPU cores, 7 GB of RAM, and 5 minutes of CPU time. In 2021, FuSeBMC distributed its allocated time to its various engines as follows. The fuzzer received 150 s as a time limit when running on benchmarks from the Cover-Error category and 70 s on benchmarks from the Cover-Branches category. The bounded model checker received 700 s and 780 s on benchmarks from the two categories, respectively. Finally, the selective fuzzer received 50 s for benchmarks from both categories. In 2022, these figures were tweaked. The seed generation received 20 s for benchmarks from both categories. The fuzzer received 200 s and 250 s on benchmarks from Cover-Error and Cover-Branches, respectively, the bounded model checker 650 s and 600 s, and the allocation for the selective fuzzer was decreased from 50 s to 30 s from the previous year.

Although the hardware setup remained unchanged across the two competition editions, the set of test generation tasks was significantly expanded. Namely, the task set in Test-Comp 2021 consisted of 3,173 tasks: 607 in the Cover-Error category, and 2,566 in the Cover-Branches category. By contrast, Test-Comp 2022 was expanded to contain 4,236 test tasks: 776 in the Cover-Error category and 3,460 in the Cover-Branches category (including a new subcategory ProductLines introduced into both categories). We have considered this when discussing the performance of two versions of FuSeBMC in Section 4.3.1. A detailed report of the results produced by the competing tools in both Test-Comp 2021³ and Test-Comp 2022⁴ is available online.

FuSeBMC source code is written in C++ and Python; it is available for download from GitHub.⁵ The latest release of FuSeBMC is v4.1.14. FuSeBMC is publicly available under the terms of the MIT license. Instructions for building FuSeBMC from the source code are given in the file README.md.

4.2 Objectives

The main goal of our experimental evaluation is to assess the improvements of FuSeBMC v4 and its suitability for achieving high code coverage and error coverage in open-source C programs. As a result, we identify three key evaluation objectives:

(Performance Improvement) Demonstrate that FuSeBMC v4 outperforms FuSeBMC v3 in both code coverage and error coverage.

(Coverage Capacity) Demonstrate that FuSeBMC v4 achieves higher code coverage for C programs than other state-of-the-art software testing tools.

(Error Detection) Demonstrate that FuSeBMC v4 finds more errors in C programs than other state-of-the-art software testing tools.

4.3 Results

4.3.1 FuSeBMC v4 vs. FuSeBMC v3.

Tables 1 and 3 contain the comparison of the FuSeBMC v4 and FuSeBMC v3 performances in Cover-Branches and Cover-Error categories of Test-Comp, respectively. FuSeBMC v3 achieved first place in Cover-Error, fourth place in Cover-Branches, and placed second overall in Test-Comp 2021, while FuSeBMC v4 reached first place in both categories and overall in Test-Comp 2022. However, considering the test generation task set has been significantly expanded in Test-Comp 2022, we analyze their relative performances in each subcategory. Namely, in Cover-Branches, we compare average code branch coverage. In Cover-Error, we compare the percentages of successfully detected errors demonstrated by both tools in every subcategory and the entire category (as well the improvements of FuseBMC v4 in comparison to FuSeBMC v3), respectively.

Table 1 shows that FuSeBMC v4 advanced in each subcategory, including the overall average improvement of 16% in the Cover-Branches category in comparison to FuSeBMC v3. The greatest increase (i.e., 53%) was demonstrated in the Combinations subcategory. FuSeBMC v3 achieved eighth place in this subcategory in Test-Comp 2021, while FuSeBMC v4 reached first place in Combinations in Test-Comp 2022. We attribute this success to the modifications in the seed generation phase of FuSeBMC v4 (in particular, the introduction of smart seeds). Table 2 presents a subset of generation tasks from the Combinations subcategory where FuSeBMC v4 demonstrated the most striking improvement. It can be seen that FuSeBMC v3 provided very low code coverage of \(\sim 6.52\%\) for these tasks on average, while FuSeBMC v4 increased this number to \(\sim 90.14\%\) (i.e., 83.62% average improvement).

As for the Cover-Error category, FuSeBMC v4 progressed by \(14\%\) on average in comparison to FuSeBMC v3 (see Table 3). FuSeBMC v4 improved in the majority of subcategories while showing no change in three subcategories: both FuSeBMC versions achieved the highest possible result of 100% in BitVectors, FuSeBMC v4 failed to advance the number of detected errors past 95% in Recursive, while FuSeBMC v4 could not identify any errors in DeviceDrivers similarly to FuSeBMC v3. Also, FuSeBMC v4 demonstrated a performance degradation of 2% in the XCSP subcategory. This phenomenon is attributable to certain programs that redefine C library functions and encompass numerous conditional statements, resulting in a slowdown in seed generation and excessive utilization of the tool’s resources.

Additionally, we compared the performance of FuSeBMC v4 utilizing smart seeds with the version of FuSeBMC v4 using only primary seeds (i.e., all zeros, all ones, and randomly chosen values) on the ECA (which stands for event-condition-action systems) subcategory in Cover-Error (where FuSeBMC v4 demonstrated 28% improvement in comparison to FuSeBMC v3 in the competition settings; see Table 3). It contains 18 test-case generation tasks with C programs featuring input validation that involves relatively complex mathematical expressions. Such a program feature is notoriously difficult for the fuzzers whose initial seed is based on a random choice. Table 4 presents the results obtained by the versions of FuSeBMC v4 with smart seeds and with primary seeds. It can be seen that smart seeds allow detecting five more bugs than the version of FuSeBMC using standard seeds.

Table 4.

Task name	FuSeBMC v4
Task name	Smart Seeds	Primary Seeds
eca-rers2012/Problem05_label00.yml	TRUE	TRUE
eca-rers2012/Problem06_label00.yml	TRUE	TRUE
eca-rers2012/Problem11_label00.yml	TRUE	TRUE
eca-rers2012/Problem12_label00.yml	TRUE	TRUE
eca-rers2012/Problem15_label00.yml	TRUE	TRUE
eca-rers2012/Problem16_label00.yml	TRUE	UNKNOWN
eca-rers2012/Problem18_label00.ymll	TRUE	TRUE
eca-rers2018/Problem10.yml	TRUE	TRUE
eca-rers2018/Problem11.yml	TRUE	TRUE
eca-rers2018/Problem12.yml	TRUE	UNKNOWN
eca-rers2018/Problem13.yml	TRUE	UNKNOWN
eca-rers2018/Problem14.yml	TRUE	UNKNOWN
eca-rers2018/Problem15.yml	TRUE	UNKNOWN
eca-rers2018/Problem16.yml	UNKNOWN	UNKNOWN
eca-rers2018/Problem17.yml	UNKNOWN	UNKNOWN
eca-rers2018/Problem18.yml	UNKNOWN	UNKNOWN
eca-programs/Problem101_label00.yml	UNKNOWN	UNKNOWN
eca-programs/Problem103_label32.yml	UNKNOWN	UNKNOWN

Table 4. Comparison of FuSeBMC v4 Performance with Smart Seeds and with Standard Seeds, where TRUE Shows that the Bug Has Been Detected Successfully, UNKNOWN Means, Otherwise

Overall, the results presented in Tables 1 and 3 provide sufficient evidence that the evaluation objective O1 has been achieved.

4.3.2 FuSeBMC v4 vs. State-of-the-art.

FuSeBMC v4 achieved the overall first place at Test-Comp 2022, obtaining a score of 3,003 out of 4,236 with the closest competitor, VeriFuzz [28], scoring 2,971 and significantly outperforming several state-of-the-art tools such as LibKluzzer [53], KLEE [22], CPAchecker [16], and Symbiotic [24] (see Table 5).

Table 5.

Total # tasks	Tool
Total # tasks	CMA-ES Fuzz	CoVeriTest v2.0.1	FuSeBMC v4.1.14	HybridTiger v1.9.2	KLEE v2.2	Legion v1.0	Legion/SymCC	LibKluzzer v1.0	PRTest v2.2	Symbiotic v9.0	Tracer-X v1.2.0	VeriFuzz v1.2.10
4,236	382	2,293	3,003	1,830	2,125	787	–	2,658	945	2,367	1,069	2,971

Table 5. Test-Comp 2022 Overall Results⁶

The table illustrates the scores obtained by all state-of-the-art tools overall, where we identify the best tool in bold.

Table 6 demonstrates the code coverage capabilities of FuSeBMC v4 in comparison to other state-of-the-art software testing tools. It can be seen that FuSeBMC achieved first place with an overall score of 2,104 out of 3,460. FuSeBMC participated in all 16 subcategories, in nine of which (i.e., Arrays, BitVectors, Floats, Heap, Loops, ProductLines, Recursive, Combinations, and Termination) it achieved first place and in six of which it reached second place. The results presented in Table 6 allow us to conclude that the evaluation objective O2 has been achieved.

Table 6.

	Subcategory	Total # tasks	CMA-ES Fuzz	CoVeriTest v2.0.1	FuSeBMC v4.1.14	HybridTiger v1.9.2	KLEE v2.2	Legion v1.0	Legion/SymCC	LibKluzzer v1.0	PRTest v2.2	Symbiotic v9.0	Tracer-X v1.2.0	VeriFuzz v1.2.10
		Tool
Arrays	Subcategory	400	159	257	328	247	104	210	263	323	160	250	243	307
BitVectors	62	26	49	49	16	33	34	45	48	33	49	49	48
ControlFlow	67	5	36	43	14	25	17	41	40	6	43	39	44
ECA	29	0	6	11	2	7	3	3	10	2	10	8	12
Floats	226	53	113	122	92	19	72	62	103	46	55	56	119
Heap	143	23	100	104	84	93	81	78	104	49	98	101	101
Loops	727	211	574	591	467	380	357	542	575	359	538	544	587
ProductLines	263	19	77	77	56	74	70	74	77	48	69	77	77
Recursive	53	25	41	45	39	21	26	27	43	11	45	40	41
Sequentialized	103	0	79	90	58	35	1	43	75	11	51	57	91
XCSP	119	0	116	107	119	102	2	102	118	102	114	96	110
Combinations	671	63	238	401	167	196	179	224	292	79	338	295	351
BusyBox	75	0	12	25	6	21	0	0	24	15	19	18	29
DeviceDrivers	290	13	60	59	6	25	56	47	57	16	42	56	57
SQLite-MemSafety	1	0	0	0	0	0	0	0	0	0	0	0	0
Termination	231	143	212	213	195	118	168	145	204	60	179	192	202
Cover-Branches	3,460	624	1,860	2,104	1,406	1,242	1,033	1,487	1,990	896	1,802	1,746	2,075

Table 6. Cover-Branches Category Results at Test-COMP 2022⁸

The best score for each subcategory is highlighted in bold.

Similarly, Table 7 demonstrates the error detecting abilities of FuSeBMC v4. In particular, FuSeBMC achieved first place in nine subcategories (i.e., Arrays, BitVectors, ControlFlow, Floats, Heap, Loops, ProductLines, Recursive, and BusyBox) reaching the first overall place in this category with the result of 628 out of 776 ( \(\sim 81\) % success rate). Overall, the results show that FuSeBMC produces test-cases that detect more security vulnerabilities in C programs than state-of-the-art tools, which successfully demonstrates that the evaluation objective O3 has been achieved.

Table 7.

Subcategory	Total # tasks	Tool
Subcategory	Total # tasks	CMA-ES Fuzz	CoVeriTest v2.0.1	FuSeBMC v4.1.14	HybridTiger v1.9.2	KLEE v2.2	Legion v1.0	Legion/SymCC	LibKluzzer v1.0	PRTest v2.2	Symbiotic v9.0	Tracer-X v1.2.0	VeriFuzz v1.2.10
Arrays	100	0	73	99	69	89	67	–	97	37	74	0	99
BitVectors	10	0	8	10	6	9	0	–	10	5	8	0	10
ControlFlow	32	0	18	32	16	27	0	–	27	0	24	0	30
ECA	18	0	3	13	1	13	0	–	11	0	14	0	15
Floats	33	0	25	33	23	6	0	–	30	3	0	0	32
Heap	56	0	49	53	42	52	3	–	53	13	53	0	53
Loops	157	0	75	146	53	95	4	–	136	102	81	0	142
ProductLines	169	0	160	169	53	169	34	–	169	92	159	0	169
Recursive	20	0	7	19	5	16	0	–	17	1	17	0	16
Sequentialized	107	0	61	102	92	86	0	–	81	0	79	0	104
XCSP	59	0	50	52	52	37	0	–	5	0	41	0	55
BusyBox	13	0	0	2	0	1	0	–	0	0	0	0	2
DeviceDrivers	2	0	0	0	0	0	0	–	0	0	0	0	0
Cover-Error	776	0	0	628	355	500	57	–	528	145	463	0	623

Table 7. Cover-Error Category Results at Test-Comp 2022⁹

The best score for each subcategory is highlighted in bold.

5 Related Work

In this section, we overview related work. Most related techniques fall into one of the following categories: fuzzing, symbolic execution, or a combination of both.

5.1 Fuzzing

Barton Miller [10] proposed fuzzing at the University of Wisconsin in the 1990s, and it became the popular software vulnerabilities detection technique [73]. One of the most common fuzzing tools is AFL [2, 19]. AFL is a coverage-based fuzzer that was built to find software vulnerabilities. AFL relies on an evolutionary approach to learn mutations by measuring code coverage. By employing genetic algorithms with guided fuzzing, AFL yields high code coverage. Another tool is LibFuzzer [69], which uses code coverage information generated by LLVM’s (SanitizerCoverage) instrumentation to produce test-cases. LibFuzzer is best suited for testing libraries that have small input with a run-time of milliseconds for each input to guarantee not crashing on invalid input in library code.⁷ Vuzzer [66] is a fuzzer with an application-aware strategy. The main advantage of this strategy is that it does not need any knowledge of the application or input format in advance.

To maximize coverage and explore deeper paths, the tool leverages control- and data-flow features based on static and dynamic analysis to infer fundamental properties of the application. This enables a much faster generation of interesting inputs than an application-agnostic approach. Wang et al. [74] proposed an approach that utilizes data-driven seed generation. It relies on extracting the knowledge of grammar to process and generate well-distributed seed inputs for fuzzing programs. Skyfire is designed for probabilistic context-sensitive grammar (PCSG) to identify syntax features and semantic rules. AFLFast [20] is an enhanced version of AFL applying various strategies to exercise a low-frequency path. The tool achieved a 7x speedup over AFL [20].

GTFuzz [56] is a tool that prioritizes inputs based on extracting syntax tokens that guard the target place. The backward static analysis technique extracts these tokens. Also, GTFuzz benefits from this extraction by improving the mutation algorithm. Smart grey-box fuzzing (SGF) [65] is a fuzzer that employs high-level structural representation on the original seeds to generate high-impact seeds. Similarly, AFLSmart [65] is a structure-aware fuzzer that combines the PEACH fuzzing engine with the AFL fuzzer. Instrim [46] is a Control Flow Graph (CFG) -aware fuzzer. It analyzes the CFG of the PUT in an attempt to instrument fewer code blocks, thereby speeding up the fuzzing. AutoFuzz [43] is a tool that utilizes fuzzing to verify network protocols. It begins with finding the protocol specifications and then finding vulnerabilities by using fuzzing Also, there is Peach Fuzzer [76] is an approach that sends random input to a PUT in an attempt to find security vulnerabilities. It is frequently used to detect security vulnerabilities in input validation and application logic. Various other fuzzers and fuzzing techniques have been developed, each with unique features. For example, directed greybox fuzzing [19] uses simulated annealing in an attempt to guide the fuzzer to explore a particular section of the PUT. SYMFUZZ [23] controls the selection of paths, while Alexandre Rebert’s approach [67] uses guided seed selection.

One of the weaknesses of pure fuzzing approaches is their inability to find test-cases that explore program code beyond complex guards, as they essentially work by randomly mutating seeds and, therefore, struggle to find inputs that satisfy the guards.

5.2 Symbolic Execution and Bounded Model Checking

Symbolic execution and BMC have shown competence in producing high-coverage test-cases and detecting errors in complex software. One of the more popular symbolic execution engines is KLEE [22]. KLEE is a tool that explores the search space path-by-path by utilizing LLVM compiler infrastructure and dynamic symbolic execution. KLEE has been utilized in many specialized tools for its reliability as a symbolic execution engine. For example, Symbiotic 8 [27] symbolic execution tool is built on top of KLEE while adding plugins, such as Predator [34] and DG [25], to fulfill different functions. The latest addition to the tool is Slowbeast [26], which incorporates the k-induction technique into symbolic execution. Furthermore, Tracer [48] is a verification tool that uses constraint logic programming (CLP) and interpolation methods. Another tool based on symbolic execution is DART [40]. It conducts software analysis and applies automatic random testing to find software bugs. BAP [21] is developed on top of Vine [71], which relies on symbolic execution. BAP has useful analysis and verification techniques. BAP relies on an intermediate language (IL) in its analysis. Also, SymbexNet [70] and SymNet [68] are for verification of network protocols implementation. Avgerinos, Thanassis, et al. [7] presented an approach that enhances symbolic execution with verification-based algorithms. It works to increase the effect of dynamic symbolic execution. The approach showed its ability to detect bugs and achieve higher code coverage than other dynamic symbolic execution approaches. CoVeriTest [15] is a Cooperative Verifier Test generation that utilizes a hybrid approach for test generation. It applies different conditional model checkers in iterations with many value-analysis configurations. CoVeriTest also changes the level of cooperation and assigns the time budget of each verifier. However, symbolic execution suffers from the path explosion problem related to loops and arrays, impacting its practicality.

5.3 Combination

The combination of symbolic execution and BMC with fuzzing has been used recently to combine the strengths of both techniques. For example, VeriFuzz [28] is a state-of-the-art tool we have previously compared to FuSeBMC. The authors describe it as a program-aware fuzz tester that combines feedback-driven evolutionary fuzz testing with static analysis. It also employs grey-box fuzzing to exploit lightweight instrumentation for observing the behaviors that occur during test runs. VeriFuzz earned first place in Test-Comp 2020 [11]. Driller [72] is a hybrid vulnerability excavation tool developed by Stephens et al. It finds deeply embedded bugs by leveraging guided fuzzing and concolic execution in a complementary manner. It employs concolic execution to analyze the program and trace the inputs. Concolic execution also guides fuzzing on different paths by using a constraint-solving engine. Stephens et al. combine the strengths of the two techniques and mitigate the weaknesses to avoid path explosion in concolic analysis. First, Driller divides the application, based on checks of particular values of a specific input, into compartments. Then by utilizing the proficiency of fuzzing, Driller explores possible values of general input in a compartment. Although Driller showed its efficiency in detecting more vulnerabilities, it may drive to path explosion problems because it requires a lot of computing power. MaxAFL [49] is best described as a gradient-based fuzzer built on top of AFL. First, the developer finds the Maximum Expectation of Instruction Count (MEIC) using lightweight static analysis. Then, using MEIC, they generate an objective function and apply a gradient-based optimization algorithm to generate efficient inputs by minimizing the objective function. Hybrid Fuzz Testing [63] is a tool that generates provably random test-cases efficiently such that it guarantees the execution of unique paths. Moreover, it finds unique execution paths by using symbolic execution to find frontier nodes that lead such paths. The tool also collects all possible frontier nodes depending on resource constraints to employ fuzzing with provably random input, preconditioned to lead to each frontier node.

He et al. [44] proposed an approach to learning a fuzzer from symbolic execution. First, it phrases the learning task in the framework of imitation learning. Then, it employs symbolic execution to generate quality inputs with high coverage while a fuzzer learns using neural networks to be used to fuzz new programs. Badger [61] provides a hybrid testing approach for complexity analysis. It generates new input using Symbolic PathFinder [64] and provides the Kelinci fuzzer with worst-case analysis. Badger aims at increasing coverage and resource-related cost for each path by using fuzz testing. LibKluzzer [53] is a novel implementation combining Symbolic execution and fuzzing. Its strength resides in the fusion of coverage-guided fuzzing and white-box fuzzing strengths. LibKluzzer is constructed of LibFuzzer, and an extension of KLEE called KLUZZER [52]. Munch [62] is an open-source framework hybrid tool. It employs fuzzing with seed inputs generated by symbolic execution and targets symbolic execution when fuzzing saturates. It aims at reducing the number of queries to the SMT solver to focus on the paths that may reach uncovered functions. The developers created Munch to improve function coverage. FairFuzz [54] is a grey-box fuzzer that utilizes guided-mutation. It uses coverage to achieve the guidance by employing a mutation mask for every pair of seeds and the rare branches to direct the fuzzing to reach each rare branch. SAFL [75] is an efficient fuzzer for C/C++ programs. It utilizes symbolic execution (in a lightweight approach) to generate initial seeds that can get an appropriate fuzzing direction. Scalable Automated Guided Execution (SAGE) proposed by Godefroid et al. [41] is a hybrid fuzzer developed at Microsoft Research. Microsoft uses SAGE extensively, where it has successfully found many security-related bugs. It employs generational search to extend dynamic symbolic execution and increase code coverage by negating and solving the path predicates. Also, SAGE relies on the random test style used by DART to mutate good inputs using grammar.

Overall, the combination of Fuzzers and Symbolic execution to verify software has been successful. Our approach builds on this combination by utilizing smart seed generation, the Tracer subsystem, and other unique features as described above.

6 Conclusion

In this article, we presented FuSeBMC v4, a test generator that relies on smart seed generation to improve the state-of-the-art in hybrid fuzzing and achieve high coverage for C programs. First, FuSeBMC analyses and injects goal labels into the given C program. Then, it ranks these goal labels according to the given strategy. After that, the engines are employed to produce smart seeds for a short time. Then, FuSeBMC starts test generation by running a bounded model checking engine along with fuzzers. This Tracer will generally manage the tool to record the goals covered and deal with the transfer of information between the engines by providing a shared memory to harness the power and take advantage of the power of each engine. So, the BMC engine helps give the seed that prevents the fuzzing engine from struggling with complex mathematical guards. Furthermore, Tracer dynamically evaluates test-cases to convert high-impact cases into seeds for subsequent test fuzzing. This approach was evaluated by participating in the fourth international competition on software testing Test-Comp 2022. Our approach FuSeBMC showed its effectiveness by achieving first place in the Cover-branches category, first place in the Cover-Error category, and first place in the Overall category. This performance is due to various features of our tool, the most important of which are the following. First, the generation of smart seeds, which help harness the power of the fuzzers and allow them to fuzz deeper. Second, simplifying the target program by limiting the bounds of potentially infinite loops to avoid the path explosion problem and produce seeds faster. Third, utilizing the static analysis to manage the mutation process by limiting the range of values input variables can take, speeding up the fuzzing process. In the future, we plan on developing a tool to deal with different types of programs, such as multi-threaded programs. Furthermore, we work with the SCorCH project¹⁰ to improve our performance in detecting memory safety bugs by incorporating SoftBoundCETS [59] into FuSeBMC.

Footnotes

https://github.com/kaled-alshmrany/FuSeBMC

https://busybox.net/

https://test-comp.sosy-lab.org/2022/results/results-verified/

⁴

https://test-comp.sosy-lab.org/2021/results/results-verified/

⁵

https://github.com/kaled-alshmrany/FuSeBMC

⁶

https://test-comp.sosy-lab.org/2022/results/results-verified/

⁷

https://llvm.org/docs/LibFuzzer.html#q-so-what-exactly-this-fuzzer-is-good-for

⁸

https://test-comp.sosy-lab.org/2022/results/results-verified/

⁹

https://test-comp.sosy-lab.org/2022/results/results-verified/

¹⁰

https://scorch-project.github.io/about/

References

[1]

2015. Clang Documentation. Retrieved August, 2019 from http://clang.llvm.org/docs/index.html

Abstract

1 Introduction

2 Preliminaries

2.1 Fuzzing

2.2 Code Coverage

2.3 Bounded Model Checking

2.4 Hybrid Fuzzing Techniques

3 FuSeBMC V4 Framework

3.1 Overview

3.2 Code Instrumentation and Static Analysis

3.2.1 Code Instrumentation.

3.2.2 Static Analysis.

3.2.3 Shared Memory.

3.3 Seed Generation

3.4 Test Generation

3.4.1 Main Fuzzer.

3.4.2 Bounded Model Checker.

3.4.3 Tracer.

3.4.4 Selective Fuzzer.

4 Evaluation

4.1 Description of Benchmarks and Setup

4.2 Objectives

4.3 Results

4.3.1 FuSeBMC v4 vs. FuSeBMC v3.

4.3.2 FuSeBMC v4 vs. State-of-the-art.

5 Related Work

5.1 Fuzzing

5.2 Symbolic Execution and Bounded Model Checking

5.3 Combination

6 Conclusion

Footnotes

References

Index Terms

Recommendations

FuSeBMC v4: Smart Seed Generation for Hybrid Fuzzing: (Competition Contribution)

Code Coverage Aware Test Generation Using Constraint Solver

FuSeBMC: An Energy-Efficient Test Generator for Finding Security Vulnerabilities in C Programs

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations