research-article

Open access

DatAFLow: Toward a Data-Flow-Guided Fuzzer

Authors:

Adrian Herrera,

Mathias Payer,

Antony L. HoskingAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 5

Article No.: 132, Pages 1 - 31

https://doi.org/10.1145/3587156

Published: 21 July 2023 Publication History

PDF eReader

Abstract

Coverage-guided greybox fuzzers rely on control-flow coverage feedback to explore a target program and uncover bugs. Compared to control-flow coverage, data-flow coverage offers a more fine-grained approximation of program behavior. Data-flow coverage captures behaviors not visible as control flow and should intuitively discover more (or different) bugs. Despite this advantage, fuzzers guided by data-flow coverage have received relatively little attention, appearing mainly in combination with heavyweight program analyses (e.g., taint analysis, symbolic execution). Unfortunately, these more accurate analyses incur a high run-time penalty, impeding fuzzer throughput. Lightweight data-flow alternatives to control-flow fuzzing remain unexplored.

We present datAFLow, a greybox fuzzer guided by lightweight data-flow profiling. We also establish a framework for reasoning about data-flow coverage, allowing the computational cost of exploration to be balanced with precision. Using this framework, we extensively evaluate datAFLow across different precisions, comparing it against state-of-the-art fuzzers guided by control flow, taint analysis, and data flow.

Our results suggest that the ubiquity of control-flow-guided fuzzers is well-founded. The high run-time costs of data-flow-guided fuzzing (~10 × higher than control-flow-guided fuzzing) significantly reduces fuzzer iteration rates, adversely affecting bug discovery and coverage expansion. Despite this, datAFLow uncovered bugs that state-of-the-art control-flow-guided fuzzers (notably, AFL++) failed to find. This was because data-flow coverage revealed states in the target not visible under control-flow coverage. Thus, we encourage the community to continue exploring lightweight data-flow profiling; specifically, to lower run-time costs and to combine this profiling with control-flow coverage to maximize bug-finding potential.

1 Introduction

Fuzzers are an indispensable item in the software-testing toolbox. The idea of fuzzing—to test a target program by subjecting it to a large number of inputs—can be traced back to an assignment in a graduate Advanced Operating Systems class [49]. These fuzzers were relatively primitive (compared to a modern fuzzer): they simply fed a randomly-generated input to the target, failing the test if the target crashed or hung. They did not model program or input structure, and only observed the input/output behavior of the target. In contrast, modern fuzzers use sophisticated program analysis to model program and input structure, and continuously gather dynamic information about the target.

Exploiting this dynamic information drives fuzzer efficiency. For example, coverage-guided greybox fuzzers—perhaps the most widely-used class of fuzzer—track code paths executed by the target.¹ This allows the fuzzer to focus its mutations on inputs that reach new code. Intuitively, a fuzzer can only find bugs in code it executes, so maximizing the amount of code covered should implicitly maximize the number of bugs found. Code coverage also approximates program behavior: expanding code coverage implies exploring new or different program behaviors.

Coverage-guided greybox fuzzers are now pervasive. Their success [56] is attributable to one greybox fuzzer in particular: American Fuzzy Lop (AFL) [73]. AFL uses lightweight instrumentation to track edges covered in the target’s control-flow graph (CFG). A large body of research has built on AFL [3, 7, 8, 16, 21, 24, 33, 43, 70], and while improvements have been made, most fuzzers still default to edge coverage as an approximation of program behavior. Is this the best we can do?

In some targets, control flow offers only a coarse-grained approximation of program behavior. This includes targets whose control structure is decoupled from its semantics (e.g., LR parsers generated by yacc) [71]. Such targets require data-flow coverage [11, 22, 27, 34, 55, 62, 71] to accurately capture program behavior. Whereas control flow focuses on the order of operations in a program (i.e., branch and loop structures), data flow instead focuses on how variables (i.e., data) are defined and used [55]; indeed, there may be no control dependence between definition and use sites (see Section 3 for details).

In fuzzing, data flow typically takes the form of dynamic taint analysis (DTA), in which the target’s input data is tainted at its definition site and tracked as it is accessed and used at run time. Unfortunately, accurate DTA is difficult to achieve and expensive to compute (e.g., prior work has found DTA is expensive [23, 60] and its accuracy highly variable across implementations [15, 60]). Moreover, several real-world programs fail to compile under DTA, increasing deployability concerns. Thus, most widely-deployed greybox fuzzers (e.g., AFL [73], libFuzzer [42], and honggfuzz [65]) eschew DTA in favor of higher execution rates.

While lightweight alternatives to DTA exist (e.g., Redqueen [5], GreyOne [23]), the full potential of control- vs. data-flow fuzzer coverage metrics remains to be thoroughly explored. To support this exploration, we present datAFLow, a greybox fuzzer that tracks a program’s data flow (rather than control flow) without requiring DTA. Notably, our work performs data-flow analysis inline with the execution, directly guiding the fuzzer. This is in contrast to prior work (e.g., GreyOne), which performed post hoc trace analysis in an attempt to infer or approximate data flow. Unlike DTA, which strives for accuracy, we take inspiration from popular greybox fuzzers (e.g., AFL) and embrace some imprecision to reduce overhead and thus maximize fuzzing throughput.

We perform a large-scale evaluation (>3 CPU-yr) of datAFLow’s effectiveness, comparing it against three state-of-the-art fuzzers. Our evaluation on the Magma benchmark [26] shows that, while generally outperformed by control-flow-guided fuzzers, datAFLow uncovers bugs that these fuzzers fail to find. This is because data-flow coverage revealed states in the target not visible in the CFG. Curiously, this is despite the control-flow-guided fuzzers achieving more control- and data-flow coverage (on targets previously identified as being amenable to data-flow-guided fuzzing [47]). We determined the run-time costs of data flow tracking to be the root cause of this result; intuitively, the cost of data-flow-guided fuzzing is not recoverable in targets where data flow mostly follows control flow. We encourage the community to continue exploring data-flow-guided fuzzing to maximize bug discovery.

Summary of Contributions

We contribute the following, making our work available at https://github.com/HexHive/datAFLow:

(1)

A framework for reasoning about and constructing data-flow coverage metrics for greybox fuzzing (Section 4).

(2)

A new data-flow-guided fuzzer, datAFLow, to explore data flow in a target program with low overhead (Section 5).

(3)

An extensive evaluation and comparison of representative fuzzers guided by control flow, taint analysis, and data flow (Section 6).

2 Background & Related Work

2.1 Fuzzing

Fuzzing is a dynamic analysis for finding bugs in a target program by subjecting it to random inputs. Coverage-guided greybox fuzzers—the most popular class of fuzzer—do not just blindly feed these random inputs into the target. Instead, they use a feedback loop based on a coverage metric. This feedback loop guides the fuzzer toward generating inputs that explore new behaviors of the target (as determined by the coverage metric).

Figure 1 illustrates the architecture of a typical coverage-guided greybox fuzzer. The user provides (a) an instrumented program called the fuzzing target, and (b) an optional set of starting inputs called seeds (an empty seed is used if not provided [29]).

Fig. 1.

The fuzzer places the seeds into a queue and then: (i) selects a seed from the queue; (ii) mutates the seed (via bit-flipping, value substitution, etc.); (iii) executes the target with the mutated seed, storing coverage (or an approximation thereof) in a coverage map; and (iv) detects crashes and newly-discovered coverage in the target (saving the former for offline analysis and discarding the seed, or in the latter returning the seed into the queue for further exploration by mutation). This process repeats until the residual risk of a missed bug falls beneath a suitable threshold [6].

2.2 Data-flow Analysis

Data-flow analysis typically refers to a collection of techniques for reasoning about the run-time flow of values in a program. These techniques can be static—such as those used by compilers for liveness analysis, constant propagation, and reaching definition analysis—or dynamic. Dynamic data-flow analysis is an approach adopted in software testing for reasoning about the sequence of actions performed on data (i.e., program variables) at run time [13, 31, 32]. These actions are typically analyzed in terms of the interactions between a variable’s definition—or def site—and how that variable is used at one or more use sites [55, 62]. Data flows between these definition and usage sites are known as def-use chains.

Empirical studies have shown the effectiveness of data-flow coverage metrics over control-flow metrics when developing software tests [22, 27, 34, 55, 62] and comparing program executions [64]. However, to the best of our knowledge, these data-flow techniques have not yet been explored by the fuzzing community.

2.3 Related Work

Fuzzing is an active area of research. Consequently, we focus on recent work related to coverage metrics for fuzzing. We summarize the fuzzers discussed below in Table 1, comparing them to our datAFLow fuzzer (described further in Sections 4 and 5).

Table 1.

	Angora	GreyOne	Confetti	Ijon	InvsCov	DDFuzz	GraphFuzz	datAFLow
Feedback	CS edge	Edge	Edge	Edge + value	Edge + value	Edge + DDG	Edge	def-use
Manual analysis	✗	✗	✗	\(\checkmark\)	✗	✗	✗	\(\checkmark\)
“Exact” DTA	\(\checkmark\)	✗	\(\checkmark\)	✗	✗	✗	✗	✗
“Appx.” DTA	✗	\(\checkmark\)	\(\checkmark\)	✗	✗	✗	✗	✗

Table 1. Survey of Related Fuzzers

CS = context sensitive. Ijon requires manual analysis to identify variables to track. Confetti uses exact and approximate DTA to provide both global and local hints, respectively.

The most popular fuzzers are those guided by code coverage [44]. Typically, this code coverage is based on a target’s control-flow graph (CFG) and is measured at either basic block or edge granularities. While edge coverage is typically considered more sensitive than basic-block coverage, as we shall see in Section 3, it is not without its issues. Indeed, TortoiseFuzz showed that basic-block coverage is effective when paired with other coverage metrics that increase sensitivity (e.g., function call and loop coverage) [70].

To improve mutation precision, some fuzzers use dynamic taint analysis (DTA) to track input bytes. The fuzzer uses this information to infer which bytes to mutate. Unfortunately, DTA suffers from accuracy and performance issues [15, 36, 60], limiting deployment. To overcome performance issues, Angora [12] amortizes DTA cost by limiting its application to once per input (over many mutations) [12]. Other fuzzers avoid DTA in favor of approximate taint tracking; e.g., Redqueen [5] uses input-to-state correspondence, based on the idea that “parts of the input directly correspond to the memory or registers at run time”. Similarly, GreyOne [23] infers taint by monitoring the value of variables as input bytes are mutated, while Confetti [39] uses concolic execution to overcome missing data-flow relationships and implicit flows (see Section 3).

Alternatives to code coverage metrics are also being explored. MemFuzz [16] and AFL-Sensitive [69] augment edge coverage with memory access information. In theory, this approach allows the fuzzer to distinguish between executions that cannot be distinguished by control flow alone. In practice, this approach leads to saturation of the fuzzer’s coverage map.

To give more say to the human analyst (e.g., to prevent coverage map saturation), Ijon [4] introduced an annotation mechanism for tracking key state variables in the coverage map (e.g., Mario’s \(x\) and \(y\) coordinates in the game Super Mario Bros.). This approach overcame fuzzer roadblocks that automated approaches could not.

InvsCov [20] augments code coverage with the value of and relationships between key program variables. These variables are based on likely invariants (i.e., invariants that hold for a set of dynamic traces but may not hold for all inputs); the violation of a likely invariant indicates “interesting” program behavior (and is recorded in the coverage map).

DDFuzz [47] also augments code coverage with data flows between program variables. Here, data flows are derived from the target’s data dependency graph (DDG). DDGs describe the data flows between instructions in a program and are traditionally used by optimizing compilers [19]. Like InvsCov, DDFuzz only considers a subset of program variables (to prevent state explosion and coverage-map saturation): variable def sites are restricted to load and alloca instructions in the LLVM intermediate representation (IR), while variable use s are restricted to store and call instructions. Further filtering is applied to discard data flows subsumed by edge coverage.

GraphFuzz [25] fuzzes library APIs by modeling sequences of executed functions as a data flow graph. Using a data flow graph and control-flow-based coverage feedback, GraphFuzz generates fuzzing harnesses that explore a greater range of API combinations.

Despite the body of work on fuzzer coverage metrics, pure data flow coverage remains an under-explored metric. This is likely due to the perceived run-time cost of measuring data flow [20, 69]. Nevertheless, we hypothesize lightweight data-flow tracking is possible. To this end, we introduce datAFLow, a data-flow-guided greybox fuzzer with a tunable sensitivity range.

3 Motivating Data-flow Coverage

A fuzzer’s coverage metric should accurately capture/approximate program behavior with minimal run-time overheads. Here we discuss why control-flow-based metrics are insufficient to accurately capture program behavior, using Figure 2 as a running example.

Fig. 2.

While basic block and edge coverage (the most pervasive coverage metrics in greybox fuzzers) are performant, they often provide a poor approximation of program behavior. This is because code coverage ultimately represents a static view of the target, whereas data-flow coverage more closely captures the target’s run-time computations, i.e., how input is consumed by the target.

Fuzzers using basic-block coverage cannot differentiate between different orderings of the same blocks. This can be improved by using edge coverage, which allows the fuzzer to differentiate between a loop’s forward and backward edges (such as the loops at Lines 6 and 8 in Figure 2).

Unfortunately, edge coverage still loses important information about program behavior (e.g., greybox fuzzers rely on coverage information to decide which input mutations lead to new program behaviors). However, uncovering new behaviors can be highly inefficient because a fuzzer guided by code coverage alone cannot identify which mutated input bytes led to new program behavior.

Some fuzzers address this issue (i.e., determining which input bytes to mutate) by applying dynamic taint analysis (DTA). DTA improves mutation accuracy by tracking the subset of program values used as arguments to comparison operations. However, the effectiveness of DTA depends on its taint policy, which specifies the taint relationship between an instruction’s input and output.

In Figure 2, max is user-controlled (i.e., the user selects the maximum prime number) and is therefore the taint source. While max is read directly on Lines 3, 4, 6, and 8, it is prime accesses that most accurately capture the program behavior. From a bug-finding perspective, prime accesses are also the most likely source of memory-safety vulnerabilities.

Given max determines the size of prime (via malloc, Line 3), taint may propagate to prime. However, this is an implicit flow that the taint policy may not capture. For example, compiler-based DTA—e.g., LLVM’s DataFlowSanitizer (DFSan) [66]—cannot track taint outside uninstrumented code (e.g., through functions provided by external libraries, such as malloc). Ensuring taint is accurately tracked in uninstrumented code requires significant manual effort. Moreover, prior work has shown this accuracy to be highly variable and dependent on the DTA implementation (e.g., due to incorrect taint policies and unsupported instructions) [15].

DTA is also expensive. She et al. [60] found none of their targets completed within a \(24 \,h\) period when run with the Triton DTA tool. We also found that Angora’s compiler-based DTA (built on DFSan) exhibited a run-time overhead of \(32.79\times\) over the same uninstrumented code from the SPEC CPU2006 benchmark suite (see Section 6.2). This is notable because prior work has found DFSan to be one of the more performant DTA frameworks (due to compile time—rather than run time—instrumentation) [60].

Given the disadvantages of DTA (low accuracy and high cost), we propose an alternative approach: tracking data flows between prime’s def (Line 3) and use sites (Lines 7 and 9). The following section describes our data-flow tracking approach.

4 Design

A greybox fuzzer should maintain accurate coverage information without negatively impacting performance. These requirements exist irrespective of the coverage metric used. With this in mind, we describe: (i) a theoretical foundation for constructing data-flow-based coverage metrics; (ii) how datAFLow incorporates these observations; and (iii) the implementation of a datAFLow prototype.

4.1 Coverage Sensitivity

Based on Section 2.2, we define data-flow coverage as follows:

Data-flow coverage is the tracking of def-use chains executed at run time.

This definition allows us to explore data-flow-based coverage metrics with different sensitivities [57, 69]. We follow the program analysis literature and define sensitivity as a coverage metric’s ability to discriminate between a set of program behaviors [37]. In fuzzing, a coverage metric’s sensitivity is its ability to preserve a chain of mutated test cases until they trigger a bug [69]. Different sensitivities allow us to balance efficacy and performance: more sensitive metrics incur higher performance penalties (e.g., edge coverage sensitivity is increased by incorporating function call context [12]. However, this requires additional instrumentation, increasing run-time overhead [57]).

Like traditional data-flow analysis (Section 2.2), our data-flow coverage metric requires identifying variable def and use sites. Following Horgan and London [31], we define a data-flow variable def site as a name referring to storage allocated statically (e.g., storage class static, global) or automatically (i.e., local to a procedure). We deviate from this definition by: (i) including calls to dynamic memory allocation routines (e.g., malloc); and (ii) excluding reallocations/reassignments that would traditionally kill a definition. Instead, def s are only killed when they (a) go out of scope (e.g., a local variable in a returning procedure), or (b) are explicitly deallocated (e.g., via free). Consequently, a use site includes both reads/writes from/to a def site. We deviate from the classic definition to ensure scalability: the difficulties of scaling data-flow analyses on real-world programs are well known [11, 27, 62]. We believe reducing precision by not killing definitions (when assigning a new value to a variable) is a suitable trade-off to maintain scalability.

Once we identify def and use sites, datAFLow instruments these sites (using compiler-based instrumentation, discussed in Section 5) so def-use chains can be tracked at run time. However, exactly which def-use sites are instrumented (and hence which are tracked) depends on the required sensitivity. Inspired by Wang et al. [69], this leads us to define a pair of sensitivity lattices—one for def sites and another for use sites, in Figure 3—that can be composed to achieve the desired overall sensitivity (we discuss related threats to validity in Section 5.4).

Fig. 3.

4.1.1 Def Site Sensitivity.

Complete data-flow coverage requires identifying and instrumenting all variable def sites. Unfortunately, the overhead to achieve this level of sensitivity is prohibitively expensive [10]. Therefore, a method for identifying (and hence instrumenting) a subset of important program variables is required. Ideally, this would be an (almost entirely) automated process, reducing the developer burden on the user.

One approach is to partition def sites by type and restrict instrumentation to def sites of a given type (or type set). Figure 3(a) shows the sensitivity lattice for this type-based partitioning.

Partitioning def sites by type has several advantages. For example, instrumenting array variables focuses the fuzzer on memory-safety vulnerabilities. Similarly, tracking the data flow of structs may allow for the discovery of type confusion vulnerabilities [35, 61]. Type-based partitioning requires some upfront knowledge of the target to ensure meaningful variables are tracked at run time. For example, the fuzzer may miss important program behaviors (and hence bugs) if “uninteresting” variables are tracked (e.g., max in Figure 2).

4.1.2 Use Site Sensitivity.

Figure 3(b) shows the use site sensitivity lattice. Variables are either read from or written to (i.e., “accessed”). Variable accesses are strictly more sensitive than just writes or reads on their own. The simplest and least sensitive metrics only track when a variable is accessed (shown at the top of the lattice).

Conversely, the most sensitive data-flow coverage metrics are ones that track not only when a particular variable is accessed, but the value of that variable when accessed. For example, considering Line 9 in Figure 6, this is the difference between writing to prime and assigning it the value 0. The latter is akin to traditional data-flow testing, which focuses on the values that variables take at run time [55, 62], and is similar to GreyOne, which monitors (a subset of) program variables and their values to infer taint [23]. Depending on the def site sensitivity, this approach will quickly saturate the fuzzer’s coverage map (due to the path collision problem [24]); a middle ground between this overly sensitive approach and simple accesses is required.

We achieve this middle ground by incorporating more fine-grained spatial information into a variable’s use. This is particularly useful when def sites include arrays and/or structs (e.g., Line 9 in Figure 2), as def-use chains are now differentiated by the offset at which an array/struct is accessed (analogous to a field-sensitive static analysis).

4.1.3 Composing Sensitivity Lattices.

Different def-use sensitivities can be composed to track data flow at different granularities. We reuse the code in Figure 2 to illustrate this. Given the def sensitivity lattice in Figure 3(a), either: (i) all three variables (prime, i, and j); (ii) the indices i and j; or (iii) only the prime array are instrumented (and hence tracked). Here we restrict def site instrumentation to array variables. Consequently, only prime is tracked. This leads to varying def-use chains depending on the use site sensitivity.

Simple access. The

region in Figure 3(b). Tracks when prime is accessed (Lines 7 and 9 in Figure 2). This results in two def-use chains: Line 3 \(\leadsto\) Line 7 and Line 3 \(\leadsto\) Line 9. This is equivalent to basic block coverage (per Section 2.1): to reach the use at Line 9 requires the execution of all basic blocks in the CFG. Like block coverage, this provides a poor approximation of program behavior (as information about the loop and how it affects data is lost).

Access with offset. The

region in Figure 3(b). Tracks when prime is accessed along with the offsets where prime is accessed (indices i and j). This provides a more complete view of how prime is used with negligible overhead. This is similar to MemFuzz’s approach, which incorporates memory accesses into code coverage [16]. This results in \(2 \times (\mathtt {max} - 2)\) def-use chains: one for every read/write at each index where prime is read from/written to.

Access with value. The

region in Figure 3(b). Tracks when prime is accessed along with the values (being read/written) during these accesses. This is the most sensitive use site coverage metric and achieves the goal of traditional data-flow coverage: associate values with variables, and how these associations can affect the execution of the target [55]. This is similar to GreyOne’s “taint inference”, which looks at the value of variables used in path constraints [23].

Again, this level of sensitivity results in \(2 \times (\mathtt {max} - 2)\) def-use chains. Here, prime’s value range is fully deterministic. However, these values will typically depend on user input, resulting in rapid saturation of the fuzzer’s coverage map.

By composing def and use sensitivity lattices, we realize a variety of data-flow-based coverage metrics. We do this in our fuzzer, datAFLow, described in the following sections.

5 Implementation

Figure 4 depicts datAFLow’s high-level architecture, including: (i) compiler instrumentation (built on LLVM v12) for capturing def-use sites at the desired sensitivity (Sections 5.1 and 5.2); and (ii) a run-time library for feeding data-flow information to the fuzzing engine (Section 5.3).

Fig. 4.

Our architecture is agnostic to the underlying fuzzer; the instrumented target produced by the compiler (and linked with the fuzzalloc run-time library) can be executed by any AFL-based fuzzer (i.e., any fuzzer using an AFL-style coverage map). However, instead of recording and tracking control-flow coverage, the fuzzer’s coverage map tracks data-flow coverage.

5.1 Def-Use Site Identification

We must first identify def and use sites so that data flows between these sites can be tracked. Per Section 4.1, def site selection impacts coverage sensitivity: more instrumented def sites leads to more complete data-flow coverage. We implement several def site instrumentation schemes based on the type-based partitioning described in Section 4.1.1.

We make the following assumptions during def-use site identification. First, we assume debug metadata is available in the LLVM IR. We use this metadata to identify and limit variable def sites to source-level variables. Second, we assume tracked variables are accessed via memory references (i.e., load/store instructions), rather than registers. This is automatic for most composite types (e.g., arrays). For primitive types (e.g., integers), this requires demoting registers to memory references (via LLVM’s reg2mem pass).

The first assumption reduces the number of potential data flows and is adopted from prior work [20, 47]. The second assumption limits use sites to memory access instructions, simplifying instrumentation. We apply existing LLVM transforms to limit use sites to two instructions: loads and stores.² Exactly which instructions we instrument depends on the use sensitivity required (configured at compile time). We describe our instrumentation in Section 5.2.

5.2 Def-Use Tracking

We reduce the run-time tracking of def-use chains to a metadata management problem. Here, def site identifiers are the metadata requiring efficient retrieval at use sites. Inspired by AFL’s approach for tracking edge coverage—where basic blocks (in the LLVM IR) are statically assigned a random 16-bit integer—we statically “tag” def sites (again, in the LLVM IR) with a random 16-bit integer (Section 5.2.1). This tag is then propagated to use sites, where it is retrieved and used to construct a def-use chain (Section 5.3).

5.2.1 Def Site Instrumentation.

We adopt Padding Area MetaData (PAMD) [41] for tracking def-use chains. PAMD extends baggy bounds checking, a technique proposed by Ding et al. [17] for protecting C and C++ code against buffer overruns. PAMD attaches inline metadata to memory objects (hence our assumption that tracked variables are accessed via memory references; Section 5.1) and provides constant-time lookup of this metadata. This lookup occurs via the “baggy bounds table”, which stores the binary logarithm of an object’s size and alignment (denoted \(e\) ). Once \(e\) is retrieved from the baggy bounds table, the base and size of an object pointed to by \(p\) is computed using:

\begin{align} {\it base} & = p \mathbin \& \mathord {\sim }(2^e - 1) \end{align}

(1)

\begin{align} {\it size} & = 2^e \end{align}

(2)

Equations (1) and (2) require an object’s size and alignment to be a power-of-two. To meet this requirement, PAMD pads static objects (i.e., stack and global variables) before attaching the def site tag. Figure 5 illustrates this process. For example, given a 4-byte object, then \({\it size} = 8\) , \(e = 3\) (the binary logarithm of size), and two bytes of padding is inserted before the tag.

Fig. 5.

Objects whose padding or overall size becomes too large for static allocation are “heapified” (i.e., move to the heap). We adopt CCured’s [51] approach to heapify objects. For heap-allocated objects (including heapified objects), calls to malloc, calloc, and realloc are replaced with tagged versions (e.g., __bb_malloc) accepting the 16-bit tag as an additional argument. Figures 6 and 7 demonstrate our def-use instrumentation.

Fig. 6.

Fig. 7.

Figure 6 shows the (un)instrumented LLVM IR for the Sieve of Eratosthenes (Figure 2). We focus our def site instrumentation on the dynamically-allocated prime array (Line 2 in Figure 6(a)). DatAFLow tags this def site with the identifier 1337 (Line 2 in Figure 6(b)). This tagging occurs by replacing malloc with __bb_malloc, which also registers the allocation in the baggy bounds table.

In comparison, Figure 7 demonstrates the instrumentation of a stack def site. The original code (Figure 7(a)) statically allocates an 8-byte buffer buf, filling it via a call to read. The buffer’s second element is later accessed. During compilation, datAFLow resizes buf to meet PAMD’s object size requirement. Here, six bytes of padding is inserted before the two-byte tag (Line 2 in Figure 7(b)). This def site is tagged with the identifier 1102 (Line 4) and registered in the baggy bounds table (Line 8).

5.2.2 Use Site Instrumentation.

Per Section 5.1, use sites are limited to load and store instructions in the LLVM IR (e.g., Line 7 in Figure 6(a) and Line 10 in Figure 7(a)). We instrument these instructions with a call to __hash_def_use, which retrieves the object’s size from the baggy bounds table and uses this size to retrieve the def tag. The size is also used to determine the offset at which an object is accessed (enabling the access with offset sensitivity described in Section 4.1.3). Like def sites, use sites are tagged at compile time with a randomly-generated identifier (e.g., 4242 at Line 7 in Figure 6(b) and 1234 at Line 17 in Figure 7(b)). Finally, we leverage several techniques from AddressSanitizer (ASan) [59] to limit the number of use instrumentation sites, thereby reducing overhead without sacrificing precision. We describe the internals of __hash_def_use, and how it integrates with the fuzzer, in the following section.

5.3 Fuzzer Integration

The __hash_def_use function constructs a def-use chain by hashing together the def and use sites. This hash is used as a lookup into the fuzzer’s coverage map to guide the fuzzer toward discovering new data flows. This is analogous to AFL tracing edges to discover new control flow paths. Consequently, we leverage techniques used by traditional greybox fuzzers (e.g., compact bitmaps) to efficiently record data-flow coverage [44].

In particular, we use coarse data-flow coverage metrics—def-use chain hit counts stored in a compact bitmap—to achieve efficient fuzzing. While these techniques result in path collisions [24], we are willing to tolerate such imprecision to limit overhead costs. Coarse coverage metrics also lower implementation costs, enabling the reuse of existing fuzzing engines (here, AFL++ [21]).

We adopt AFL’s hashing process for looking up data flows in the fuzzer’s coverage map. By default, AFL represents a control-flow edge using the following hash algorithm:

\begin{equation} \begin{aligned}i & \leftarrow l \oplus l_{\mathrm{prev}} \\ l_{\mathrm{prev}} & \leftarrow l \gg 1 \end{aligned} \end{equation}

(3)

Where \(l\) is a randomly-generated basic-block identifier (assigned at compile time) and \(l_{\mathrm{prev}}\) is the identifier of the previously-executed block. AFL uses the result \(i\) as an index into the coverage map. Right-shifting \(l\) allows AFL to differentiate between different orders of two blocks. Our hash algorithm varies depending on the desired data-flow sensitivity (Section 4.1.3):

Simple access. Xor of the def and use site tags:

\begin{align} i & \leftarrow {\it def} \oplus {\it use} \end{align}

(4)

Access with offset. The def site tag, use site tag, and the offset being accessed. The offset (e.g., array index, struct offset) is computed by subtracting the base address—found using Equation (1)—from pointer \(p\) . We compute the hash as:

\begin{align} i & \leftarrow {\it def} \oplus ({\it use} + \textrm {offset}) \end{align}

(5)

Access with value. The def site tag, use site tag, the offset, and the value accessed. The def/use tags and offset are left-shifted to allow room for the value hash and reduce collisions. The accessed value is divided into single-byte chunks \(\lbrace v_0, v_1, \ldots \rbrace\) that are hashed into the def-use chain:

\begin{align} i & \leftarrow ({\it def} \oplus ({\it use} + \textrm {offset}) \ll 2) \oplus (v_0 \oplus v_1 \oplus \ldots) \end{align}

(6)

This is implemented as a loop, resulting in a double load of the accessed object (in addition to the load in the original code). We implement the __hash_def_use function so that uninstrumented data flows (i.e., those without an entry in the baggy bounds table) are bucketed in their own coverage map entry.

5.4 Threats to Validity

5.4.1 Def Site Selection.

Our def site selection approach (Section 4.1.1) is incomplete: important data flows may be missed if the appropriate def sites are not instrumented. Per our def site sensitivity lattice, our prototype focuses on composite types (i.e., arrays and structs) and eschews instrumenting primitive types (e.g., integers). While this approach may miss important data flows, we accept this trade-off, given (a) memory safety remains a key concern [50], and (b) the prohibitive run-time overheads when tracking all def sites.

5.4.2 Custom Memory Allocators.

Identifying def sites is complicated because many applications do not directly call the standard allocation routines (e.g., malloc), but indirectly through a custom memory allocator. For example, standard memory allocation routines may be wrapped in other functions. These functions may then be indirectly called via global variables/aliases, stored and passed around in structs, or used as function arguments.

To address the challenge imposed by custom memory allocators and memory allocation patterns, datAFLow allows the user to specify wrapper functions to tag (in addition to the standard allocation routines). While datAFLow requires the user to find these wrappers manually, existing techniques [14] could assist in this process. We wrap these memory allocation routines within trampoline functions when their address is taken (e.g., stored in a global variable). Rather than a compile-time def site tag (which may not be statically computable), these trampolines revert to using the lower 16-bits of the PC as the def site tag. This approach avoids the need for expensive and imprecise static analysis (e.g., to track the access of memory allocators through global variables).

5.4.3 C++ Dynamic Memory Allocation.

To simplify our instrumentation, we rewrite C++ new calls as malloc calls. However, this prevents us from handling any std::bad_alloc exceptions, meaning any failed allocations will cause a program crash (irrespective of any exception handlers in place). Such false negatives are removed by replaying crashing inputs through the original target.

5.4.4 Coverage Imprecision.

Storing coarse coverage information in a compact bitmap is inherently inaccurate and incomplete [24]. While this may limit datAFLow’s ability to discover and explore data flows, this limitation is not unique to datAFLow and affects many greybox fuzzers [3, 4, 12, 16, 20, 23, 33, 43, 47, 69, 70, 73].

6 Evaluation

We perform an extensive evaluation (>3 CPU-yrof fuzzing) to test the following hypothesis:

Data-flow-guided fuzzing offers superior performance (over control-flow-guided fuzzers) on targets where control flow is decoupled from semantics.

Specifically, we answer the following research questions:

RQ1

Is data-flow-guided fuzzing viable with minimal run-time overheads? (Section 6.2)

RQ2

Does data-flow-guided fuzzing find more or different bugs? (Section 6.3)

RQ3

Does data-flow-guided fuzzing expand more coverage? (Section 6.4)

RQ4

Can we predict a priori the targets most amenable to data-flow-guided fuzzing? (Section 6.5)

6.1 Methodology

6.1.1 Fuzzer Selection.

Our evaluation compares the performance of fuzzers using: (i) pure control-flow coverage; (ii) pure data-flow coverage; and (iii) exact and approximate DTA, combining control-flow coverage with data-flow tracking.

We select AFL++ as the pure control-flow-guided fuzzer because it is the current state-of-the-art coverage-guided greybox fuzzer. We configure AFL++ with: (i) link-time optimization (LTO) instrumentation, eliminating hash collisions; and (ii) with and without “CmpLog” instrumentation. CmpLog—inspired by Redqueen’s input-to-state correspondence [5]—approximates DTA by capturing comparison operands. Similarly, we select Angora as an alternative control-flow-guided fuzzer (using context-sensitive edge coverage) that also incorporates exact DTA. Finally, we select DDFuzz as an alternative data-flow-guided fuzzer.

We configure datAFLow with: (i) two def site sensitivities: arrays only (“A”) and arrays \(+\) structs (“A+S”); and (ii) three use site sensitivities: simple access (“A”), accessed offset (“O”), and accessed value (“V”). We use the notation “ \(X\) / \(Y\) to refer to the composition of \(X\) def and \(Y\) use site sensitivities; e.g., “A/A” refers to array def and access use sites; “A+S/O” refers to arrays \(+\) structs def and accessed offset use sites. The evaluated fuzzers are summarized in Table 2.

Table 2.

Name	Map size (KB)	Description
A_LTO	—	AFL++ with LTO instrumentation
A_CL	—	AFL++ with LTO and CmpLog instrumentation
An	1024	Angora
DD	64	DDFuzz
D_A/A	1024	datAFLow with array defs and accessed use s
D_A/A+O	1024	datAFLow with array defs and accessed offset use s
D_A/A+V	1024	datAFLow with array defs and accessed value use s
D_A+S/A	1024	datAFLow with array \(+\) struct defs and accessed use s
D_A+S/A+O	1024	datAFLow with array \(+\) struct defs and accessed offset use s
D_A+S/A+V	1024	datAFLow with array \(+\) struct defs and accessed value use s

Table 2. Evaluated Fuzzer Configurations

Angora and DDFuzz use their default map sizes. AFL++’s LTO instrumentation does not require a fixed-size map.

6.1.2 Target Selection.

We evaluate the ten fuzzers in Table 2 on the following targets. We fuzz 20 target programs in total.

SPEC CPU2006. The SPEC CPU benchmark suite [28] is an industry-standardized, CPU-intensive benchmark suite for stress-testing a system’s processor, memory subsystem, and compiler. We use SPEC CPU2006 to answer RQ1.

Magma. Unlike other fuzzing benchmarks (e.g., UniFuzz [40]), Magma [26] contains ground-truth bug knowledge. We exclude the php target because it failed to build with AFL++’s CmpLog instrumentation (failing with a segmentation fault). We use 15 Magma targets to answer RQ2.

DDFuzz dataset. Mantovani et al. [47] select five targets—bison, pcre2, mir, qbe, and faust—they believe to contain a large number of data dependencies, and hence are amenable to data-flow-guided fuzzing. We use newer versions of these targets (because some did not compile on Ubuntu 20.04), shown in Table 3. We use these targets to answer RQ3.

Table 3.

Target	Driver	Command line	Commit hash
bison	bison	@@ -o /dev/null	5555f4d
pcre2	pcre2test	@@ /dev/null	db53e40
mir	c2m	@@	852b1f2
qbe	qbe	@@	c8cd282
faust	faust	@@	13def69

Table 3. DDFuzz Target Dataset

6.1.3 Experimental Setup.

We conduct all experiments on an Ubuntu 20.04 AWS EC2 instance with a 48-core Intel^® Xeon^® Platinum 8275CL \(3.0 \,GHz\) CPU and \(92 \,GiB\) of RAM. Each fuzz run was conducted for \(24 \,h\) and repeated five times (ensuring statistically sound results). All targets were bootstrapped with their provided seeds.³ Finally, we (a) manually located and specified memory allocation functions for datAFLow to tag, and (b) used Angora’s default behavior to discard taint when calling an external library.

6.2 Run-time Overheads (RQ1)

Conventional wisdom assumes data-flow-based coverage metrics are too heavyweight, adversely affecting a fuzzer’s performance by reducing its execution rate. We investigate the extent to which this assumption is true by isolating the effects of instrumentation overhead outside of a fuzzing environment. Per Section 6.1.2, we measure performance overheads on SPEC CPU2006.

Table 4 shows the overhead of all ten evaluated fuzzers on all 19 C and C++ targets in the SPEC CPU2006 v1.0 benchmark suite. We compare these measurements against a baseline without instrumentation (clang v12), calculating the geometric mean (“geomean”) and \(95\%\) bootstrap confidence intervals (CI) over three repeated iterations. The following results are omitted because they failed to build or run: AFL++ (LTO) 445.gobmk triggered a run-time assertion; datAFLow (all configurations) 429.mcf crashed with a run-time segmentation fault; and Angora 447.dealII, 471.omentpp, 473.astar, and 483.xalancbmk failed to link with DFSan’s run-time library.

Table 4.

Target	A_LTO	A_CL	An	DD	DF_A/A	DF_A/O	DF_A/V	DF_A+S/A	DF_A+S/O	DF_A+S/V
	Fuzzer ( \(\times\) )
400.perlbench	\(1.27\)	\(3.86\)	\(141.85\)	\(21.81\)	\(12.04\)	\(12.75\)	\(16.34\)	\(12.51\)	\(13.21\)	\(16.79\)
401.bzip2	\(1.26\)	\(2.17\)	\(25.83\)	\(2.54\)	\(7.75\)	\(8.53\)	\(11.12\)	\(7.69\)	\(8.49\)	\(11.09\)
403.gcc	\(1.30\)	\(3.40\)	\(21.19\)	\(3.45\)	\(19.53\)	\(21.22\)	\(26.07\)	\(19.58\)	\(21.18\)	\(26.20\)
429.mcf	\(1.12\)	\(2.46\)	\(12.08\)	\(1.52\)	✗	✗	✗	✗	✗	✗
445.gobmk	✗	\(2.48\)	\(23.41\)	\(5.26\)	\(6.99\)	\(7.51\)	\(9.48\)	\(6.92\)	\(7.44\)	\(9.73\)
456.hmmer	\(1.12\)	\(3.08\)	\(60.41\)	\(1.56\)	\(13.47\)	\(15.07\)	\(21.61\)	\(13.60\)	\(14.95\)	\(21.62\)
458.sjeng	\(1.21\)	\(4.36\)	\(29.69\)	\(4.44\)	\(7.57\)	\(8.13\)	\(10.05\)	\(7.54\)	\(8.01\)	\(10.32\)
462.libquantum	\(1.20\)	\(2.40\)	\(27.09\)	\(1.61\)	\(3.81\)	\(4.04\)	\(6.97\)	\(3.70\)	\(4.14\)	\(6.97\)
464.h264ref	\(1.19\)	\(1.82\)	\(41.01\)	\(1.88\)	\(100.63\)	\(109.40\)	\(134.10\)	\(100.96\)	\(109.68\)	\(134.58\)
471.omnetpp	\(1.06\)	\(2.02\)	✗	\(2.05\)	\(6.58\)	\(6.34\)	\(6.82\)	\(6.15\)	\(6.34\)	\(7.56\)
473.astar	\(1.13\)	\(2.19\)	✗	\(1.59\)	\(5.53\)	\(5.83\)	\(6.80\)	\(5.60\)	\(5.96\)	\(7.28\)
483.xalancbmk	\(1.29\)	\(5.04\)	✗	\(3.48\)	\(11.44\)	\(12.25\)	\(15.67\)	\(11.66\)	\(12.43\)	\(15.93\)
	\(1.19\)	\(2.80\)	\(32.79\)	\(2.91\)	\(10.69\)	\(11.41\)	\(14.64\)	\(10.65\)	\(11.47\)	\(15.01\)
Geomean	\(\,\pm \,0.00\)	\(\,\pm \,0.01\)	\(\,\pm \,0.34\)	\(\,\pm \,0.03\)	\(\,\pm \,0.13\)	\(\,\pm \,0.18\)	\(\,\pm \,0.22\)	\(\,\pm \,0.16\)	\(\,\pm \,0.21\)	\(\,\pm \,0.21\)

Table 4. SPEC CPU2006 Overhead

Computed as the geomean (over three repeated iterations) relative to an uninstrumented benchmark (compiled with clang v12). The \(95\%\) bootstrap CI is reported for the geomean across all targets (for a given fuzzer). The bootstrap CI is zero for individual targets and hence is omitted.

Per Section 3, Angora has a geomean overhead of \(32.79\times\) . This is particularly notable because previous work has found DFSan—the framework upon which Angora’s taint tracking mode is built—to be one of the more performant DTA frameworks [60]. However, while this overhead is significantly higher than AFL++ (LTO) and AFL++ (CmpLog)—which have geomean overheads of \(1.19\times\) and \(2.80\times\) , respectively—it is important to recall Angora amortizes this cost over the lifetime of a fuzzing campaign by only tracking taint once on a given input over many mutations.

Of the six datAFLow configurations, A/A has the lowest overhead ( \(10.69\times\) ), while A \(+\) S/V has the highest ( \(15.01\times\) ). This is unsurprising, given the rolling hash approach used for the “access with value” use sensitivity (Section 5.3). Performance improvements are possible by specializing the hash function based on the type of value accessed (e.g., hashing a uint64_t or float value directly, rather than dividing it into single-byte chunks). Increasing the def site sensitivity to include structs added minimal overhead. However, this is target specific: the median number of tracked arrays (across the 12 SPEC CPU2006 targets) is 51, compared to 33 structs. This result may not generalize across targets where structs outnumber arrays.

Our results reflect those presented by Liu and Criswell [41] (e.g., 464.h264ref has the highest run-time overhead in both the original and our works). However, there is a significant increase in our run-time overheads compared to the original PAMD implementation [41]. To validate our PAMD (re)implementation we evaluated a version of datAFLow that only performed metadata lookup in the baggy bounds table (i.e., it did not construct def-use chains nor update the fuzzer’s coverage map). This version of datAFLow has a geomean overhead of \(3.97\times\) . Def-Use chain construction is a simple xor operation (Section 5.3), so we attribute this dramatic increase in run-time overhead to the interaction of the baggy bounds table and coverage map. In particular, cache effects associated with reading from/writing to these two tables.

6.3 Bug Finding (RQ2)

Following prior work [2, 26, 29, 68], we use survival analysis to summarize our bug-finding results. Table 5 shows the restricted mean survival time (RMST), measuring the mean time for a bug to “survive” (i.e., remain undiscovered) five repeated \(24 \,h\) fuzz runs. Lower RMSTs imply a fuzzer finds a bug “faster”, while a smaller CI implies the fuzzer finds the given bug (at a given time) more consistently. We use the log-rank test [46]—computed under the null hypothesis that two fuzzers share the same survival function—to statistically compare bug survival times. Thus, we consider two fuzzers to have statistically equivalent bug survival times if the log-rank test’s \(p\textrm {-value} \gt 0.05\) .

Table 5.

We present our bug-finding results in Table 5. Based on raw bug counts, AFL++ was the best-performing fuzzer, triggering 60 bugs. The two data-flow-driven fuzzers followed this; DDFuzz (44 bugs) and datAFLow (41 bugs). Angora was the worst-performing fuzzer, triggering only 24 bugs.

DatAFLow with “simple access” use sensitivity (DF_A/A and DF_A+S/A) was the best performing version of datAFLow (39 bugs). This was followed by DF_A+S/O (31 bugs). DatAFLow was “accessed value” use sensitivity was the worst performer. This suggests incorporating variable values at use sites is not worth the increased run-time cost; simply tracking the existence of def-use chains is “good enough” (for discovering bugs).

AFL++ remains the best-performing fuzzer when accounting for RMSTs (i.e., it triggers bugs fastest), outperforming the data-flow-guided fuzzers for the majority of bugs triggered ( \(60\%\) ). However, this result is reversed (i.e., the data-flow-guided fuzzers outperform AFL++) for \(14\%\) of the triggered bugs. Notably, datAFLow was the only fuzzer to trigger LUA003 (not previously triggered by any fuzzer in any prior Magma evaluation), while datAFLow and DDFuzz triggered XML001 (xmllint) and LUA004 orders-of-magnitude faster than AFL++. DDFuzz was the only fuzzer to trigger PDF008. However, this bug was only triggered once (over five trials) and towards the end of the trial (after \(20 \,h\) ). This suggests that the bug is difficult to find and DDFuzz may have just “gotten lucky”. Finally, AFL++ either failed to trigger or was orders-of-magnitude slower at triggering SSL009 (x509) and PDF003 (pdfimages). Do these bugs share properties that make them amenable to discovery via dat -flow-guided fuzzing? To answer this question, we examine the two lua bugs in greater depth.

LUA003. This bug is caused by a missing check of the “mode” argument to popen. The check is shown in Figure 8. While the check is quickly reached by DDFuzz (after \(\mathord {\sim }\) \(4 \,h\) ) and all six datAFLow variations (on average, after \(\mathord {\sim }\) \(60 \,s\) ), the exact trigger conditions were only met once by DF_A+S/A. Upon examining the compiled binary, we found the second check ( \({\tt m[1] == '\backslash 0'}\) ) was optimized to a branchless operation (i.e., it did not contain conditional control flow). This effectively makes the program state where \({\tt m[1] != '\backslash 0'}\) invisible to a control-flow-guided fuzzer (in particular, there is no explicit edge for AFL++ to instrument). This state is explicitly visible to datAFLow, which reaches it after \(\mathord {\sim }\) \(19 \,h\) of fuzzing.

Fig. 8.

LUA004. This is a logic bug, caused by a missing update to the interpreter’s “old” program counter (occurring under particular conditions when tracing the execution of a Lua function). Again, there is no explicit “state” in the target’s CFG for the fuzzer to reach. Instead, the bug is triggered when the oldpc field in the lua_State struct is not updated. This only happens under particular conditions, again depending on specific data values.

6.4 Coverage Expansion (RQ3)

Control-flow coverage is typically quantified by reasoning over the target’s CFG (e.g., basic blocks, edges, lines of code). For example, FuzzBench replays the fuzzer’s queue through an independent and precise (i.e., collision-free) coverage metric; specifically, Clang’s source-based coverage [48, 67]. However, the equivalent process for quantifying data-flow coverage does not exist.

We quantify coverage expansion using both control-flow and data-flow metrics, using (a) static analyses to approximate an upper bound, and (b) dynamic analyses to quantify coverage expansion against this upper bound. The usual limitations of static analysis (e.g., undecidability) mean this upper bound may be larger than the set of executable coverage elements (e.g., a code region may not be reachable from the target’s driver, or a pointer’s points-to set may be over-approximated). We accept this imprecision for both metrics. We use the Mann-Whitney \(U\) -test [45] to statistically compare dynamic coverage across fuzzers: two fuzzers cover the same number of coverage elements if the Mann-Whitney \(U\) -test’s \(p\textrm {-value} \gt 0.05\) .

Control-flow coverage. We use Clang’s existing source-based coverage metric [67]. Specifically, we use region coverage (as used by FuzzBench), Clang’s version of statement coverage. Like classic statement coverage, region coverage is more granular than function and line coverage [32]. Region information is embedded into the target during compilation and can be statically extracted using existing LLVM tooling (to obtain the upper bound).

Data-flow coverage. We develop an SVF-based [63] static analysis to compute the set of def-use chains in a target (for the set of tracked variables, as determined by the chosen def site sensitivity). This analysis leverages a flow- and context-insensitive interprocedural pointer analysis based on the Andersen algorithm [1].⁴ For the dynamic analysis, we modify the PAMD metadata stored at each def site (Section 5.2.1) to store a tuple of \(\langle {\it variable name}\) , \({\it location}\rangle\) , where location is another tuple \(\langle {\it source filename}\) , function name, line, \({\it column}\rangle\) . Both tuples are constructed by extracting source-level information from the target’s debug information. A use site (Section 5.2.2) is similarly labeled with a location tuple. Unlike the 16-bit tags used by datAFLow, this approach does not result in hash collisions and is precise (albeit with a higher run-time cost). Importantly, neither the static nor dynamic analysis take into account def-use chain values. We also exclude dynamic memory allocations from these analyses (to simplify run-time def-use tracking when faced with custom memory allocators, per Section 5.4.2).

Table 6 and Figures 9 and 10 summarize our coverage expansion results. Two targets, bison and faust, failed to build with AFL++’s CmpLog (again, due to a segmentation fault) and are excluded from our results.

Fig. 9.

Fig. 10.

Table 6.

AFL++ is again the best-performing fuzzer, achieving the highest control-flow (i.e., code region) coverage. CmpLog improves AFL++’s already-strong coverage expansion capabilities. These results are unsurprising, given control-flow coverage (specifically, edge coverage) guides AFL++. Similarly, Angora again performs poorly, outperformed by both DDFuzz and datAFLow in maximizing both control- and data-flow coverage. Curiously, however, AFL++ also achieves the highest data-flow (i.e., def-use chain) coverage. This is despite datAFLow’s data-flow guidance. We attribute this (surprising) result to the differences in fuzzer execution rates (i.e., the number of inputs executed by the fuzzer per unit of time).

6.4.1 Accounting for Execution Rates.

AFL++ (LTO) achieves a mean execution rate of \(1.172 \,execs/s\) (median \(347 \,execs/s\) ). In contrast, DDFuzz, Angora, and datAFLow achieve mean execution rates of \(974, 616 \,{\rm and}\, 270\) , respectively (median \(442, 249\, {\rm and}\, 144\) ). This dramatic decrease in execution rates reflects our overhead results in Section 6.2.

To account for differences in execution rates, rather than comparing coverage at the end of each fuzz run (i.e., after \(24 \,h\) of fuzzing), we compare coverage at a given execution (“exec”). Specifically, we compare coverage at the last exec of the slowest fuzzer (i.e., with the lowest execution rate). Intuitively, this places a “ceiling” on the coverage achieved by faster fuzzers (i.e., those able to execute more inputs within a single \(24 \,h\) fuzz run). For example, DF_A+S/V is the slowest fuzzer on bison ( \(85 \,execs/s\) ). Thus, we compare the coverage achieved at the last execution of DF_A+S/V ( \(\textrm {exec}=7358231\) ), implicitly ignoring any additional coverage expanded after this exec. Unfortunately, Angora does not provide the necessary information to map coverage to a particular exec, so we exclude it from our analysis (despite it being the slowest fuzzer on two targets: bison and faust).

We present coverage “normalized” against execution rates in Table 7. DatAFLow is now more competitive (against AFL++) in expanding data-flow coverage. It achieves the highest def-use chain coverage on bison and faust, and is only \(\mathord {\sim }\) \(4\%\) behind the number of def-use chains expanded by AFL++ on qbe. Again, increasing datAFLow’s use sensitivity to include variable values fails to improve fuzzing outcomes. These results reinforce our belief that fuzzer execution rates have a significant impact on fuzzing outcomes.

Table 7.

6.5 Characterizing Data-Flow (RQ4)

The fuzzing community has largely settled on control-flow-based coverage metrics—in particular, edge coverage—to drive a fuzzer’s exploration. While prior successes have largely validated this approach [18, 56, 58, 65, 73], we wish to understand what (if any) program characteristics lend themselves to data-flow-based coverage.

Mantovani et al. [47] propose the DD ratio—defined as the ratio between the number of basic blocks instrumented with data-dependency information over the total number of basic blocks in the target—to determine whether data-flow-based coverage (derived from the target’s DDG) adds value (e.g., over edge coverage). A higher DD ratio suggests the target is more amenable to data-flow-guided fuzzing; a target with a DD ratio above \(10\%\) is considered strongly data dependent.

Table 8 summarizes the DD ratio of our 20 target programs.⁵ Thirteen of these targets ( \(65\%\) ) have DD ratios \(\ge 10\%\) , indicating their suitability for data-flow-guided fuzzing. However, we found little correlation between a target’s DD ratio and fuzzing outcomes (both bug finding and coverage expansion). For example, png_read_fuzzer had the highest DD ratio among the Magma targets ( \(13.40\%\) ), closely followed by xmllint ( \(13.03\%\) ). However, AFL++ outperformed the data-flow-guided fuzzers (DDFuzz and datAFLow) on both targets (across bug counts and survival times). Similarly, pcre2test and c2m had the highest DD ratios among the DDFuzz targets ( \(22.60\, {\rm and}\, 21.82\) , respectively). Again, AFL++ outperformed the two data-flow-guided fuzzers (across both control- and data-flow coverage expansion).

Table 8.

Based on these results, we conclude that the DD ratio is not suitable for determining a target’s suitability for data-flow-guided fuzzing. We propose an alternative approach in Section 6.7.

6.6 Discussion

Comparison to the registered report. DatAFLow’s implementation has evolved significantly since the initial registered report [30]. In particular, def-use chain tracking changed from using low-fat pointers to PAMD (Section 5.2). Consequently, heapification of all tracked def sites is no longer required (only def sites that cannot be statically resized to fit the PAMD metadata require heapification). Surprisingly, this resulted in higher run-time overheads. Despite this, our bug-finding results improved from triggering 10 bugs to triggering 41. We attribute this improved result to PAMD’s robustness and its ability to work on a wider variety of targets (e.g., openssl failed to build with datAFLow in our preliminary evaluation).

Coverage sensitivity. In Section 4.1, we introduced a framework for reasoning about and constructing data-flow coverage metrics for greybox fuzzing. This framework allows the user to balance precision with performance. Our results suggest that fuzzing outcomes (i.e., bug finding and coverage expansion) fail to improve as precision increases. Notably, this finding also applies to Angora; Angora’s exact DTA provided little benefit over the approximate DTA used by AFL++’s CmpLog mode. Our results reflect prior findings that demonstrate the importance of maximizing fuzzer execution rates [5, 23, 29, 54, 72].

Bugs vs. coverage. Böhme et al. [9] found the fuzzer best at maximizing coverage expansion may not be the best at finding bugs. Our results reflect this finding; despite AFL++ outperforming datAFLow on coverage expansion (Section 6.4), datAFLow triggered bugs AFL++ failed to find (Section 6.3). Ultimately, fuzzers are deployed to find bugs and vulnerabilities; our findings reinforce the need for bug-based fuzzer evaluation [26, 38, 74] (not only a comparison of coverage profiles).

Computing coverage upper bounds with static analysis. In Section 6.4 we used static analysis to approximate a coverage upper bound (for both control- and data-flow coverage). In theory, this upper bound is useful for estimating the residual risk of ending a fuzz run before maximizing coverage (analogous to the residual risk of missing a bug [6]). In practice, static analysis of “real-world” programs is fraught; dynamically loaded, JIT, and inline assembly code all impact precision. Even specific command-line arguments influence the reachability of particular code regions. Thus, it is difficult to determine how realistic the upper bounds in Section 6.4 are. We leave it to future work to improve estimating coverage-based residual risk.

Testing our hypothesis. We hypothesized that data-flow-guided fuzzing offers superior performance on targets where control flow is decoupled from semantics. Our results lead us to reject this hypothesis. In most cases, control-flow-guided fuzzers outperformed data-flow-guided fuzzers (across both bug-finding and coverage-expansion metrics, and on targets identified as being amenable to data-flow-guided fuzzing). However, we are not prepared to give up on data-flow-guided fuzzing; despite lower run-time costs than DTA, datAFLow’s run-time costs remain high, negatively impacting coverage expansion. Despite this impediment, datAFLow discovers bugs control-flow-guided fuzzers do not. We believe reducing the run-time costs of data-flow-guided fuzzers will improve fuzzing outcomes.

6.7 Future Work

The significant run-time overheads remain the primary impediment to the adoption of data-flow-guided fuzzing (see Section 6.2). Liu and Criswell [41] propose using interprocedural optimizations to eliminate unnecessary object (de)allocation in the baggy bounds table, improving performance. Similarly, more sophisticated pointer analyses (e.g., those provided by SVF) could be used to eliminate unnecessary def/use site instrumentation (e.g., removing redundant instrumentation when def-use chains can be statically identified).

Per Section 5.3, datAFLow is prone to hash collisions. It is well known that hash collisions cause fuzzers to miss program behaviors [24]. While AFL++’s LTO mode solves the hash collision problem for edge coverage, we did not investigate a similar technique for def-use chain coverage. A hash-collision-free datAFLow may lead to improved coverage expansion.

Finally, datAFLow exclusively uses def-use chain coverage to drive exploration. In contrast, other data-flow-guided fuzzers (e.g., InvsCov [20], DDFuzz [47]) combine data flow with control flow. Given our bug-finding results—i.e., those where datAFLow significantly outperformed AFL++ (e.g., LUA003, LUA004, SSL009, and PDF003)—combining datAFLow with hash-collision-free edge coverage may provide a “best of both worlds” solution (echoing the conclusions reached by Salls et al. [57]). This combination of coverage metrics could be realized by combining control- and data-flow coverage in a single coverage map, maintaining separate coverage maps, or dynamically switching between different instrumented targets.

Our results in Section 6.5 led us to conclude that the DD ratio was not suitable for determining a target’s suitability for data-flow-guided fuzzing. Prior work on characterizing programs for automated test suite generation is also unsuitable; e.g., the approaches proposed by Neelofar et al. [52], Oliveira et al. [53] are specific to object-oriented software and focus on control-flow features. Instead, we propose subsumption.

We say that coverage metric \(\mathcal {M}_1\) strictly subsumes metric \(\mathcal {M}_2\) , if covering all coverage elements in \(\mathcal {M}_1\) also covers all elements in \(\mathcal {M}_2\) . For example, edge coverage strictly subsumes basic block coverage. Relaxing this definition of strict subsumption allows us to quantify the number of coverage elements in \(\mathcal {M}_2\) not subsumed by \(\mathcal {M}_1\) . Intuitively, more elements in \(\mathcal {M}_2\) not subsumed by \(\mathcal {M}_1\) implies fuzzing with \(\mathcal {M}_2\) will lead to behaviors not detectable by \(\mathcal {M}_1\) . Static data-flow analysis frameworks such as those proposed by Chaim et al. [11] can be used to perform this subsumption analysis. We leave the investigation of such techniques for future work.

7 Conclusions

Observing fuzzers that introduce taint tracking along with control flow, we investigate data flow as an alternate coverage metric, making data-flow coverage a first-class citizen. Driven by empirical results and the conventional wisdom gathered over years of software-testing research, we hypothesized data-flow-guided fuzzing to offer superior outcomes (over control-flow-guided fuzzing) in targets where control flow is decoupled from semantics.

Our results show that “classic” control-flow-guided fuzzing produces better outcomes (bug finding and coverage expansion) in most cases. The high run-time costs associated with data-flow tracking impaired the fuzzer’s ability to explore a target’s behavior efficiently. Despite these costs, our data-flow-guided fuzzer discovered bugs control-flow-guided fuzzers did not. These results suggest that data-flow-guided fuzzers discover different, not more, bugs. Specifically, bugs existing in program states not explicitly visible in the target’s CFG. A better understanding of bug characteristics, rather than program characteristics, may shed light on this result. We release our data-flow sensitivity framework and datAFLow prototype at https://github.com/HexHive/datAFLow. Our hope is to stimulate further research into data-flow-guided fuzzing.

Acknowledgments

The authors are grateful to Arlen Cox, Michael Norrish, Andrew Ruef, and the anonymous reviewers for their detailed feedback and insightful suggestions for improving this work.

Footnotes

The original fuzzer of Miller et al. [49] is now known as a blackbox fuzzer (because it has no knowledge of the target’s internals).

We lower atomic memory intrinsics and expand llvm.mem* intrinsics so we can focus on load/store instructions (both of which are trivial to identify and hence instrument).

We contacted Mantovani et al. [47] to obtain their initial seed sets.

⁴

We experimented with SVF’s flow-sensitive interprocedural analysis but found the run-time overheads prohibitively large.

⁵

These values differ from the original DDFuzz evaluation [47] because we used newer versions of the targets (per Section 6.1.2).

References

[1]

Lars Ole Andersen. 1994. Program Analysis and Specialization for the C Programming Language. Ph.D. Dissertation. University of Copenhagen.

Abstract

1 Introduction

Summary of Contributions

2 Background & Related Work

2.1 Fuzzing

2.2 Data-flow Analysis

2.3 Related Work

3 Motivating Data-flow Coverage

4 Design

4.1 Coverage Sensitivity

4.1.1 Def Site Sensitivity.

4.1.2 Use Site Sensitivity.

4.1.3 Composing Sensitivity Lattices.

5 Implementation

5.1 Def-Use Site Identification

5.2 Def-Use Tracking

5.2.1 Def Site Instrumentation.

5.2.2 Use Site Instrumentation.

5.3 Fuzzer Integration

5.4 Threats to Validity

5.4.1 Def Site Selection.

5.4.2 Custom Memory Allocators.

5.4.3 C++ Dynamic Memory Allocation.

5.4.4 Coverage Imprecision.

6 Evaluation

6.1 Methodology

6.1.1 Fuzzer Selection.

6.1.2 Target Selection.

6.1.3 Experimental Setup.

6.2 Run-time Overheads (RQ1)

6.3 Bug Finding (RQ2)

6.4 Coverage Expansion (RQ3)

6.4.1 Accounting for Execution Rates.

6.5 Characterizing Data-Flow (RQ4)

6.6 Discussion

6.7 Future Work

7 Conclusions

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

DatAFLow: Toward a Data-flow-guided Fuzzer

Typestate-guided fuzzer for discovering use-after-free vulnerabilities

Accelerating Fuzzing through Prefix-Guided Execution

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations