RCABench: Open Benchmarking Platform for Root
Cause Analysis
Keisuke Nishimura, Yuichi Sugiyama, Yuki Koike,
Masaya Motoda, Tomoya Kitagawa, Toshiki Takatera, and Yuma Kurogome
arXiv:2303.05029v2 [cs.CR] 10 Mar 2023
Ricerca Security, Inc.
Abstract—Fuzzing has contributed to automatically identifying bugs and vulnerabilities in the software testing field. Although
it can efficiently generate crashing inputs, these inputs are usually
analyzed manually. Several root cause analysis (RCA) techniques
have been proposed to automatically analyze the root causes of
crashes to mitigate this cost. However, outstanding challenges
for realizing more elaborate RCA techniques remain unknown
owing to the lack of extensive evaluation methods over existing
techniques. With this problem in mind, we developed an endto-end benchmarking platform, RCABench, that can evaluate
RCA techniques for various targeted programs in a detailed
and comprehensive manner. Our experiments with RCABench
indicated that the evaluations in previous studies were not enough
to fully support their claims. Moreover, this platform can be
leveraged to evaluate emerging RCA techniques by comparing
them with existing techniques.
I.
I NTRODUCTION
Fuzzing has contributed to automatically identifying bugs
and vulnerabilities in the software testing field; it is the
process of randomly generating inputs and providing them to a
program to find crashing inputs [1], [2]. Fuzzing is simple and
easy to deploy compared to other testing methods, and it can
also automatically and efficiently find crashing inputs, helping
an enormous number of software programs improve their
quality. In fact, OSS-Fuzz [3], a popular fuzzing infrastructure,
is used in more than 650 open-source software (OSS) projects
and has found more than 40,500 bugs and vulnerabilities [4].
However, crash analysis, which is required after fuzzing,
is difficult and can be a bottleneck in making software testing
scalable. We must manually analyze these inputs later because
fuzzers only generate crashing inputs. The cost of analyzing
the crashing inputs generated by fuzzers is very high for two
reasons. First, fuzzers sometimes generate numerous crashing
inputs, the root causes of which are the same. For example,
a fuzzer generated more than 254,000 crashing inputs for
39 unique bugs in an experiment conducted in a previous
study [5]. Second, fuzzers randomly generate crashing inputs,
which means that the inputs can contain a significant amount of
noise. In other words, these inputs can include byte sequences
that are not essentially related to the crash causes and hence
Workshop on Binary Analysis Research (BAR) 2023
3 March 2023, San Diego, CA, USA
ISBN 1-891562-84-3
https://dx.doi.org/10.14722/bar.2023.23004
www.ndss-symposium.org
can be removed. During crash analysis, analysts need to
determine which parts of the inputs are related to the causes,
which costs significantly as the noise increases.
To reduce such costs in crash analysis, various automated
techniques have been proposed in the field of triage [2], [5],
[6], [7], [8]. Triage is the process of analyzing and reporting
inputs that cause crashes [2]. In triage research, root cause
analysis (RCA) has attracted particular attention in recent
years. RCA, also known as localization or fault localization,
is the process of identifying lines, basic blocks, or conditions
related to the root cause of a crash caused by a crashing input;
it provides developers with hints about the root cause. While
there are several different RCA approaches [9], [10], [11], this
study focuses on statistical fault localization, which infers the
program states correlated with the crash causes by contrasting
the execution of crashing and non-crashing inputs.
However, RCA is relatively underdeveloped among triage
topics. As discussed in detail later, state-of-the-art RCA techniques are not infallible. For example, in DeFault [8], the
average false positive rate was 9.2 %. Furthermore, RCA is
not widely applied in industry, whereas deduplication, another triage technique, is integrated into major fuzzers, such
as AFL [12] and honggfuzz [13]. Thus, the existing RCA
techniques are neither accurate nor practical enough to be
widely used in the real world; this is because RCA techniques
have not been thoroughly improved owing to undiscovered
outstanding challenges, that is, the problems to be solved to
realize more elaborate RCA techniques. One of the causes
that make such challenges still unknown would be the lack
of extensive evaluation methods over existing techniques. We
found that the following three points were unnoticed and
should be considered to realize extensive evaluation methods:
Non-uniqueness of root cause definition There can be
several possible patches to fix a complex bug. If we define (the
location in the source code of) the root cause as the location
that should be fixed, there can be multiple candidates for them.
It is not obvious for evaluators to define where root causes lie,
while it is certainly necessary for evaluating RCA techniques.
Despite this vagueness, existing techniques [6], [7], [8] did not
fully disclose the ground truth of their evaluations. This makes
it difficult to reproduce the experiments in existing studies.
Decoupling RCA steps We found that the existing techniques
consist of two separable steps: data augmentation and feature
extraction. However, these techniques have not been evaluated
separately to determine the performance of each step.
Variance-aware evaluation for data augmentation The existing techniques augment data using various fuzzing methods,
particularly those that are altered for data augmentation. The
evaluations need to consider the random nature of fuzzing.
B. Feature Extraction
Feature extraction analyzes the root cause using a dataset
generated by data augmentation. This process consists of two
steps. First, an analyzer records the state of the targeted
program at runtime while executing it with each input in the
dataset. For example, in Aurora, executed instructions and
variable values are recorded. Next, the analyzer compares
traces between the crashing and non-crashing inputs and
statistically infers their differences. This difference is indicated
as the root cause. The existing techniques estimate the lines or
basic blocks related to the root cause. Aurora also estimates
predicates, that is, simple Boolean expressions that represent
the conditions to be met before a crash occurs. When actually
used, analyzers do not report only the most likely root cause
candidate but instead multiple candidates in descending order
of the level of confidence that analyzers assign to them. In this
study, we denote VulnLoc and Aurora analyzers VulnLocFE
and AuroraFE, respectively.
Considering these three points, we developed RCABench,
an end-to-end benchmarking platform, to reveal the challenges
of RCA. We provide a detailed and comprehensive evaluation
of existing techniques for various targets and find some cases
where exsting techniques cannot correctly analyze. Moreover,
this platform can be used to evaluate the new RCA techniques
proposed in the future by comparing them with existing
techniques.
Overall, the main contributions of this work are as follows:
•
We present three problems in the evaluation methods
of existing RCA studies.
•
We developed RCABench, an open-source benchmarking platform1 ; it provides a more standardized
evaluation and helps to summarize the outstanding
challenges in RCA.
•
To better illustrate this step, we take as an example CVE2016-10094 in LibTIFF, an open-source library. As shown
in Listing 1, this vulnerability causes a heap buffer overflow
owing to an off-by-one error. Specifically, the program crashes
when the variable count is four and the statements inside the
patched if statement are executed. Generally, we refer to this if
statement as the root cause location and “count == 4 ” as the
root cause predicate. If this if statement appears frequently
in the program traces for crashing inputs and infrequently for
non-crashing inputs, the statement can be identified as the root
cause location. Similarly, if there is a distinguishable difference
in the value of the variable count between two sets of traces,
the predicate can be identified.
Through experiments with RCABench, we identified
several insights into the pitfalls of existing techniques
and provided examples to motivate further research.
II.
ROOT C AUSE A NALYSIS
Root cause analysis (RCA), also known as localization or
fault localization, is a process of automatically identifying
lines, basic blocks, or conditions related to the root cause of
a crash; it aids in debugging and reduces the cost of the crash
analysis. We analyzed state-of-the-art RCA techniques [6], [7],
[8] and identified two separable processes that are common to
all. In this study, we refer to these as data augmentation and
feature extraction2 . In this section, we describe these in detail.
Listing 1: Developer patch for CVE-2016-10094 in LibTIFF.
−
+
A. Data Augmentation
Data augmentation is the process of generating new crashing and non-crashing inputs from a given crashing input; this
is the first process in RCA, and the generated inputs are used
as datasets for the feature extraction. Therefore, the quality of
the dataset affects the RCA results.
i f ( T I F F G e t F i e l d ( i n p u t , TIFFTAG JPEGTABLES ,
&c o u n t , &j p t ) ! = 0 ) {
i f ( c o u n t >= 4 ) {
i f ( count > 4) {
i n t retTIFFReadRawTile ;
TIFFmemcpy ( b u f f e r , j p t , c o u n t − 2 ) ;
III.
C HALLENGES IN RCA E VALUATION
In this section, we describe three previously unconsidered
points which are imperative to extensive RCA evaluations.
The existing techniques augment inputs using various
fuzzing methods that are specially altered for data augmentation. For example, Aurora [6] uses the crash exploration mode
provided by AFL [12], a typical coverage-guided fuzzer. In
this study, we refer to this as AFLcem. VulnLoc [7] proposed
ConcFuzz, a directed fuzzer for efficiently generating inputs
that exercise execution paths in the neighborhood of the path
taken by a given crashing input, aiming at augmentation of
higher quality. These methods use a single crashing input as the
initial seed and automatically generate crashing/non-crashing
inputs by randomly mutating it. In fuzzing, an initial seed is
an input provided at the beginning of a fuzzing campaign.
A. Non-uniqueness of Root Cause Definition
Sometimes, there are multiple ways to fix a bug; suppose
that function B triggers a bug when it processes the data
produced by function A because the produced data conform to
rule X, whereas B expects rule Y. In this case, we can make
A comply with Y or B comply with X.
In such cases, if we define the root cause locations as the
locations in the source code that should be fixed, there can be
multiple candidates for root cause locations. Therefore, it is
difficult to correctly include all of them as the ground truth in
RCA evaluations. The evaluators currently define the ground
truth manually by coming up with all the possible patches, and
hence, the evaluators sometimes miss some of the root cause
locations and use different ground truths.
1 https://github.com/RICSecLab/RCABench
2 Although these processes were not explicitly defined in the RCA study,
inspired by similar efforts in the machine learning field, we refer to them as
data augmentation and feature extraction.
2
Data augmentation time The existing studies did not evaluate
RCA techniques with various values of the time spent in data
augmentation. For example, in Aurora, the data augmentation
time was fixed to only one value, either 2 or 12 h, depending
on targeted programs. However, the data augmentation time
can affect RCA results in multiple ways. We can generate a
dataset with a larger amount of crashing/non-crashing inputs
by spending more time in fuzzing; this may increase the dataset
diversity and improve the accuracy of feature extraction. However, it is also plausible that overfitting occurs, similarly to data
augmentation in machine learning, making the accuracy worse.
To illustrate more simply that there are multiple root cause
locations for a bug, we take CVE-2017-15232 in Libjpeg as
an example; this vulnerability causes a null pointer dereference
owing to the lack of a code to check for a null pointer. There
are several possible fixes for this vulnerability, as shown in
Listings 2 and 3. The first method, as shown in Listing 2,
is to insert a code to check for a null pointer before the for
statement. Another way, as shown in Listing 3, is to do the
same at the beginning within the for statement. Thus, the root
cause location is not uniquely determined, and identifying all
the candidates is difficult; this can occur frequently with bugs
whose root cause is the absence of code.
Initial seed The existing studies prepared only one specific
crashing input as an initial seed for fuzzing in data augmentation. For example, in VulnLoc, the initial seed is the input
used as a proof-of-concept when reporting vulnerabilities. In
fuzzing, the difference in the initial seeds is known to affect
performance, such as coverage and bug finding [14], [15].
Data augmentation using fuzzing may also affect the accuracy
of feature extraction; therefore, the evaluator should prepare
several initial seeds. In addition, the existing studies have not
focused on the characteristics of initial seeds. For example, the
initial seed generated by a fuzzer tends to be noisier and more
complex than that generated by an analyst manually.
The existence of multiple root cause locations makes it difficult to determine the ground truth and prevents the evaluation
results from being identical. In addition, the existing studies
did not fully disclose the ground truth, making evaluators have
difficulty reproducing the existing experiments accurately.
Listing 2: Developer patch for CVE-2017-15232 in Libjpeg.
+
+
i f ( o u t p u t b u f == NULL && num rows )
ERREXIT ( c i n f o , JERR BAD PARAM ) ;
f o r ( row = 0 ; row < num rows ; row ++) {
j z e r o f a r ( ( v o i d * ) o u t p u t b u f [ row ] ,
( s i z e t ) ( w i d t h * s i z e o f ( JSAMPLE ) ) ) ;
Fuzzing randomness The existing studies have not considered the randomness of fuzzing in data augmentation. They
evaluated each method using only a dataset from a single
fuzzing run. However, fuzzing is a highly stochastic process.
Hence, the generated dataset changes with each run, which can
affect the results of RCA. RCA techniques must be evaluated
multiple times to address this problem. In fuzzing studies, it
has already been standard practice to run fuzzers multiple times
and evaluate the results statistically if possible [16]. The same
approach is required in RCA studies.
Listing 3: Another developer patch for CVE-2017-15232 in
Libjpeg.
+
+
f o r ( row = 0 ; row < num rows ; row ++) {
i f ( o u t p u t b u f == NULL)
ERREXIT ( c i n f o , JERR BAD PARAM ) ;
j z e r o f a r ( ( v o i d * ) o u t p u t b u f [ row ] ,
( s i z e t ) ( w i d t h * s i z e o f ( JSAMPLE ) ) ) ;
B. Decoupling Data Augmentation and Feature Extraction
IV.
As described in Section II, we found that state-of-the-art
RCA methods [6], [7], [8] consist of two separable steps: data
augmentation and feature extraction. However, the evaluations
in these previous studies did not decouple the data augmentation and feature extraction. Evaluators should investigate
the performance of each process independently because these
are two separable steps. For example, VulnLoc [7] proposed
ConcFuzz as a data augmentation method but did not evaluate
its relative performance by replacing it with AFLcem, an
existing alternative algorithm. In other words, it has not been
fully confirmed that ConcFuzz generates datasets of higher
quality than AFLcem. Thus, the pure performance achieved
by each step of proposed methods was not measured.
P ROPOSAL : RCAB ENCH
We propose RCABench, an end-to-end benchmarking platform that can run RCA techniques on selected bugs and check
whether their results match the predefined locations of root
causes3 . The design of the RCABench was motivated by the
insights described in Section III. For each RCA technique,
the data augmentation and feature extraction steps were decoupled, which enabled the comparison and evaluation of the
augmentation and extraction methods separately.
Currently, RCABench supports two augmentation methods and two extraction methods. The available augmentation
methods are AFLcem, used in Aurora [6], and ConcFuzz4 ,
proposed in VulnLoc [7]. The available extraction methods are
the AuroraFE and VulnLocFE. We decoupled the augmentation
and extraction steps and abstracted their interfaces so that each
augmentation method could be connected to each extraction
method interchangeably since the original implementations
of AuroraFE and VulnLocFE are incompatible with ConcFuzz and AFLcem, respectively. Consequently, RCABench
C. Variance-aware Evaluation of Data Augmentation
In existing studies [6], [7], [8], evaluations did not consider
the variable characteristics of data augmentation. As described
in Section II, data augmentation generates a dataset for feature
extraction using fuzzing. Therefore, the quality of the dataset
may depend on the configuration of fuzzers, such as the initial
seeds and duration of a fuzzing campaign. The existing studies
have not dealt with this concern and have not been able to
evaluate the impact of data augmentation on RCA results
in a variance-aware manner. Specifically, the following three
variables should be considered:
3 Some RCA techniques (e.g., VulnLoc and Aurora) indicate the candidates
for root cause locations as pairs of addresses of an assembly instruction and
their corresponding source line numbers. RCABench uses line numbers for
the check because the addresses of instructions are too fine-grained to decide
whether the address is a root cause location.
4 For ConcFuzz, the time spent in saving its internal data at the end is not
included in the augmentation time.
3
can evaluate previously untested combinations of AFLcem ×
VulnLocFE and ConcFuzz × AuroraFE. Note that we used
only the root cause locations inferred by AuroraFE to compare
the performance of the techniques. Supporting and evaluating
the root cause predicates included in the outputs of AuroraFE
are left for future work.
TABLE I: Results of four RCA techniques in different
data augmentation times.
RCABench provides multiple popular real-world programs
containing actual bugs and vulnerabilities as targets of RCA.
Currently, seven targets have been prepared, all of which were
used in the evaluations of Aurora and VulnLoc [6], [7]. We
show the lists and summary of their root causes in Table II. Our
criteria for selecting targets are the availability of the source
code and the diversity of root and crash causes. As discussed in
the previous section, the selection of root cause locations can
vary and be biased. Therefore, we first registered to RCABench
several reasonable candidates for root cause location as ground
truth for each target. For stable re-evaluation, RCABench
publicly exposes these root cause locations, along with their
brief explanations. RCABench also includes one or more initial
seeds for each target to support augmentation methods that
require a crashing input as an initial seed. For targets with
multiple seeds available, we selected a crashing input used in
bug disclosure or explanation to the developers as the baseline.
V.
This section describes the results of the proposed benchmark RCABench. Through benchmarking, we answered the
following questions:
RQ1: Which RCA techniques can perform accurate
analysis on each bug?
•
RQ2: Does the increase in data augmentation time
improve accuracy?
•
RQ3: Do initial seeds affect accuracy?
•
RQ4: Does the randomness of data augmentation
affect accuracy?
D.A. Time
A×A
C×A
A×V
C×V
#1
LibTIFF
15 m
2h
4h
15
9
9
9
33
47
2
2
2
13
12
12
#2
Libjpeg
15 m
2h
4h
–
–
–
–
15
14
23
12
12
32
23
17
#3
Libjpeg
15 m
2h
4h
16
7
6
–
–
–
2
2
2
1
1
1
#4
Libxml2
15 m
2h
4h
28
28
27
–
–
–
19
23
19
16
16
17
#5
mruby
15 m
4h
12 h
41
30
31
105
60
60
59
–
–
11
45
45
#6
readelf
15 m
2h
4h
1
1
1
4
1
1
4
4
4
4
4
4
#7
Lua
15 m
4h
12 h
–
–
32
–
N/A
N/A
1
–
–
1
N/A
N/A
“C × A” means ConcFuzz × AuroraFE and “A × V” means AFLcem ×
VulnLocFE.
“N/A”: No data were obtained. “–”: The root cause location did not appear in
the candidates reported by an RCA technique.
B ENCHMARK R ESULTS
•
Program
candidates, in accordance with the original paper. “–” indicates
that the correct answer was not included in the candidates
produced by the RCA technique. “N/A” indicates that no data
could be obtained because the technique tried to produce
quite huge files or took a long time for file I/O, which were
impossible to handle with our limited machine resources.
Table I indicates that no technique can predict the root
cause locations with high accuracy for all targets: while ConcFuzz × VulnLocFE successfully inferred the correct location
of the root cause in Target #3 with the highest accuracy, its
predictions for Targets #6 were less accurate than AFLcem
× AuroraFE. Our newly tested combination, AFLcem ×
VulnLocFE, outperformed the existing methods for Targets
#1 and #2. However, they failed to find the root causes of
some other targets. This result implies that the characteristics
of the targets, which can be analyzed with high accuracy,
would be different for each RCA technique. This up-anddown situation depending on the targeted programs is similar
to fuzzer benchmarking, as seen in some results [17], [18].
All results shown here were obtained on a 256-CPU (AMD
EPYC 7742) machine with 2TB memory and Ubuntu 20.04
operating system. To investigate the relationship between the
data augmentation time and the accuracy of RCA techniques,
we ran the data augmentation process up to an imposed time
limit and, in each of the first 5, 15, 30, and 45 minutes and
every hour thereafter during the execution, RCABench saved
the dataset produced by the data augmentation at that time and
analyzed root cause with the dataset. For Targets #1-4,6, we
set the time limit to 4h, and for Targets #5,7, we extended it
to 12h in accordance with the evaluation of Aurora [6].
Answer: The technique that gave the highest rank to the
correct root cause was different for each bug, and there was
no universal technique that was most accurate.
A. Which RCA techniques can perform accurate analysis on
each bug? (RQ1)
B. Does the increase in data augmentation time improve
accuracy? (RQ2)
Table I presents the overall comparisons of RCA techniques
for each target. The numbers in the table indicate the rank
of the correct answer (actual root cause location) among the
location candidates reported by each RCA technique, ordered
by the level of confidence assigned by the technique. A lower
rank indicates that the RCA technique can infer the root
cause location more accurately (“1” is the best score). In
this experiment, we set VulnLocFE to report up to Top-200
Next, we compared the results of each technique for each
target with different data augmentation times. Table I lists
the results in three data augmentation times (15m, 2h, 4h for
Targets #1-4,6 and 15m, 4h, 12h for Targets #5,7) for each
technique and target pair. We also show two detailed examples
of how accuracy changes with time in Figure 1. In 19 cases,
out of the 26 results excluding “N/A”, the accuracy remained
the same (e.g., Target #6 except ConcFuzz × AuroraFE)
4
1
AFLcem AuroraFE
10
Ranking
Ranking
Ranking
180
30
40
240
40
30m 1h
2h
3h
Data Augmentation Time
30m 1h
2h
3h
360
4h
Data Augmentation Time
(a) Target #3
(b) Target #1
1500
A
A
C
C
7000
A (Crashes)
A (Non Crashes)
A (Crashes)
A (Non Crashes)
# of (Non-)Crashing Inputs
# of (Non-)Crashing Inputs
2000
6000
5000
4000
0
A
A
C
C
A (Crashes)
A (Non Crashes)
A (Crashes)
A (Non Crashes)
1000
30m
1h
2h
3h
Data Augmentation Time
(a) Target #1
4h
0
A
A
A
C
C
C
20
30m 1h
2h
3h
Data Augmentation Time
4h
30
30m 1h
2h
A, 324 bytes
A, 1247 bytes
A, 4825 bytes
V, 324 bytes
V, 1247 bytes
V, 4825 bytes
3h
Data Augmentation Time
4h
(b) Target #6
#6 even in 15m, considering that a large number of samples
usually leads to high accuracy in classifying tasks.
2000
500
15
Fig. 3: The transitions of the accuracy over time for different
initial seeds. A × A, and C × V denote AFLcem × AuroraFE
and ConcFuzz × VulnLocFE, respectively. In Target #4 and
#6, the 803-byte and 324-byte seeds are their original seed,
respectively.
3000
1000
10
25
(a) Target #4
Fig. 1: Accuracy vs. data augmentation time.
2500
A, 366 bytes
A, 803 bytes
A, 3000 bytes
V, 366 bytes
V, 803 bytes
V, 3000 bytes
300
50
4h
5
A
A
A
C
C
C
120
20
30
1
60
10
20
50
1
ConcFuzz AuroraFE
Ranking
1
30m
1h
2h
3h
Data Augmentation Time
4h
C. Do initial seeds affect accuracy? (RQ3)
(b) Target #6
To answer RQ3, we investigated whether the accuracy
changed depending on the initial seed. For this purpose, we
first ran AFLcem against Target #4 and #6 to produce various
crashing inputs. Then, for each target, we randomly selected
two different-length inputs from the produced inputs as initial
seeds and evaluated the RCA techniques with them. We
selected these two targets because we could find the crashing
inputs that were much smaller or larger than the original
crashing input. Thus, we believe that the newly produced initial
seeds were very different from the original ones, in terms of
seed size and the method of producing them. Note that most
of the original seeds were created manually, which may make
a significant difference between the original and new seeds
with regard to whether noise exists in them, as described in
Section I.
Fig. 2: The number of generated inputs over time. A × A,
and C × A denote AFLcem × AuroraFE, and ConcFuzz ×
AuroraFE, respectively.
or improved as the data augmentation time increased (e.g.,
AFLcem × AuroraFE on Target #3 in Figure 1a).
However, the increase in data augmentation time worsened
the accuracy in some pairs of RCA techniques and targets,
such as ConcFuzz × VulnLocFE on Target #5. Figure 1b
shows an example of the deteriorating trend of ConcFuzz ×
AuroraFE on Target #1. In this example, the highest accuracy
was achieved up to 15m, and its accuracy declined thereafter.
Answer: While the accuracy improved or did not change
over time in many cases, there were a few cases in which the
accuracy was degraded.
Figure 3 shows accuracy versus data augmentation time
for different initial seeds. While, in Target #6, the accuracies
are little affected by the difference of initial seeds in both
AFLcem × AuroraFE and ConcFuzz × VulnLocFE, the two
added initial seeds of Target #4 affected the accuracy of
ConcFuzz × VulnLocFE, and one of them had a significant
impact in particular. This result is consistent with the fact
that the performance of a fuzzer can be affected by the initial
seeds [14], [15].
Thus, sometimes data augmentation time eventually affects the accuracy. This fact indicates that somehow data
augmentation time affects the quality of datasets produced
by data augmentation. To analyze how it affects the quality,
we inspected how the number and ratio of samples (i.e.,
crashing/non-crashing inputs) in a dataset changes as data
augmentation time increases. Figure 2 plots the number of
samples versus data augmentation time for Target #1 and #6.
Answer: The difference in initial seeds sometimes affects
accuracy. This implies that evaluators should make the initial
seeds public to avoid cherry-picking and for reproducibility.
When looking at the ratio of samples produced by ConcFuzz × AuroraFE in Figure 2a, we see that the number of
non-crashing inputs starts exceeding that of crashing inputs
considerably in one hour. This would force feature extraction
methods to find out root cause locations with imbalanced
datasets of crashing/non-crashing inputs, which is very similar
to a situation called imbalanced data classification in the
machine learning field [19]. Generally, imbalanced data can
cause poor accuracy in these classifying tasks. Actually, in
Table I, the accuracy of ConcFuzz × AuroraFE perceptibly
decreases as the ratio gets imbalanced.
D. Does the randomness of data augmentation affect accuracy? (RQ4)
We observed the randomness effect on the accuracy by
evaluating the techniques five times. For ConcFuzz, different
seeds of its random number generator were set in each trial.
Consequently, we observed some non-negligible variances
as predicted in Section III, while the accuracy was very stable
in some targets. Figure 4 shows the results of AFLcem ×
AuroraFE and ConcFuzz × VulnLocFE in Target #1 and #6.
While both RCA techniques in Target #6 and ConcFuzz ×
VulnLocFE in Target #1 had little divergence in their accuracy,
Another implicative fact is that the numbers of samples for
Target #6 shown in Figure 2b are large from the beginning,
compared to those of Target #1. Perhaps, this would have made
all the RCA techniques achieve the high accuracy in Target
5
5
10
10
15
15
20
20
25
30
programs. For preparing more target programs with diverse
root causes, the programs constituting benchmarks known in
the fuzzing field [20], [21], [17], [22] would be reasonable
candidates. In particular, Magma [21] and FuzzBench [22]
provide a suite of programs that contain bugs found in the real
world and are widely used. Evaluating techniques with these
programs would further clarify their practical effectiveness.
Ranking
1
5
Ranking
1
25
AFLcem AuroraFE
ConcFuzz VulnLocFE
30m 1h
2h
3h
Data Augmentation Time
(a) Target #1
4h
30
AFLcem AuroraFE
ConcFuzz VulnLocFE
30m 1h
2h
3h
Data Augmentation Time
4h
B. Possible Improvement of Existing Techniques
(b) Target #6
In our experiments, we found some cases where RCA
techniques failed to analyze the root causes with high accuracy,
owing to their nature. A striking example is that ConcFuzz ×
AuroraFE ranked the root cause locations of Target #1 (CVE2016-10094) very low in RQ1 at data augmentation times
of 2h and 4h; this is probably because the ideal difference
that should exist between its generated crashing/non-crashing
inputs is whether or not a certain variable is at a certain value,
while ConcFuzz focuses on control flow and the generated
inputs did not have enough diversity of values. This suggests
the possibility of further improvement of data augmentation
methods by considering features other than the control flow.
Fig. 4: The effect of the randomness in data augmentation on
the accuracy. Five attempts were performed.
AFLcem × AuroraFE showed significant divergence in the
accuracy in Target #1. Specifically, one of the five trials outperformed the others. If one cherry-picked only this trial, it could
be concluded that AFLcem × AuroraFE was more accurate
than ConcFuzz × VulnLocFE, although the two techniques
achieved similar accuracies on average; this suggests that it is
important to evaluate RCA techniques multiple times and, if
possible, perform statistical analysis of the results.
Answer: For some combinations of techniques and targets, randomness in data augmentation leads to non-negligible
variances in accuracy. This result suggests that experiments to
evaluate RCA techniques should be conducted multiple times
to reduce the effect of randomness as much as possible.
VI.
Another concern about the existing techniques is that
implementation designs differ among them. For example, to
trace a program, Aurora [6] uses Intel PIN [23] and VulnLoc
uses DynamoRIO [24]. In addition, the data augmentation
method of VulnLoc is written in the programming language
Python, whereas that of Aurora in C/C++. Thus, there is a
possibility that they have different performances owing to their
implementation methods. If VulnLoc were written in C/C++,
VulnLoc would be able to run faster. In the fuzzing field,
some frameworks have already been proposed so that different
algorithms can be implemented in a uniform way to solve
such a problem [25], [26]. For example, LibAFL [25] is a
framework for building fuzzers in a modular manner. LibAFL
can reduce the cost of combining multiple fuzzing algorithms
into a single fuzzer and can fairly and objectively evaluate the
algorithms within the common implementation. If a similar
modular framework is presented for RCA, researchers would
be able to take fairer evaluations. Also, it would allow them
to implement and evaluate a new algorithm more easily.
D ISCUSSION AND F UTURE W ORK
A. Threats to Validity
Although the evaluations on RCABench and our research
questions provided thought-provoking claims in Section V, we
admit that two major threats may spoil some of the claims.
The first one is the non-uniqueness of root cause definition.
As previously noted, we have been aware that multiple root
cause locations can be the ground truth, and mitigated risk by
making our definition public and upgradable. However, this is
just a temporary countermeasure in the sense that particular
techniques can be underestimated still; it is possible that some
techniques report different valid root cause locations than our
definition while the others report ours. A more robust and
better definition and evaluation method of accuracy is desired.
VII.
The second one is fuzzing randomness. While RQ4 revealed that it undoubtedly threatens the evaluation validity in
previous studies, it also threatens the validity of our evaluation,
especially for RQ1 and RQ3 (note that RQ2 should be cared
about even within one trial, and hence its claim is valid in
that sense regardless of the threats). We acknowledge that our
results do not have a statistical significance due to our limited
computational resources and should be carefully reviewed
by others. Nevertheless, it is sure that the effectiveness and
superiority of the existing techniques are at least not so obvious
as the evaluations in the existing studies claimed. Moreover,
the lack of a statistical significance can be eventually resolved
in the future since we release RCABench as an OSS platform
and other researchers are also able to take benchmarks.
C ONCLUSION
Although fuzzing is a mature method for automatically
finding bugs, root cause analysis (RCA) techniques for discovered bugs are not full-grown. One of its causes is that the
environment for a comprehensive evaluation of existing RCA
techniques was inadequate, making it difficult to discover the
outstanding problems. Therefore, we developed a benchmark
platform, RCABench, for automatic and extensive evaluation.
Our experiments indicated that the evaluations in previous
studies were not enough to fully support their claims and
found some cases where the representative techniques failed
to analyze with high accuracy.
We believe that this initiative fosters future RCA research
by assisting researchers to propose and evaluate emerging RCA
techniques, as this study gives a glimpse of it. To shed light on
and help resolve the hidden challenges of RCA, we would like
to continue adding various targets and techniques and making
RCABench a more insightful platform.
While the above two points threaten the internal validity,
the external validity is another concern because seven programs are not enough to fully understand the behavior and performance of RCA techniques against a wide variety of targeted
6
ACKNOWLEDGMENT
[16]
We would like to thank the anonymous reviewers for their
helpful feedback. We also gratefully acknowledge the authors
of Aurora and VulnLoc, who made their implementations
and experiment configurations publically available. This work
was supported by the Acquisition, Technology & Logistics
Agency (ATLA) under the Innovative Science and Technology
Initiative for Security 2020 (JPJ004596).
[17]
R EFERENCES
[18]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
H. Liang, X. Pei, X. Jia, W. Shen, and J. Zhang, “Fuzzing: State of
the art,” IEEE Trans. Reliab., vol. 67, no. 3, pp. 1199–1218, 2018.
[Online]. Available: https://doi.org/10.1109/TR.2018.2834476
V. J. M. Manès, H. Han, C. Han, S. K. Cha, M. Egele, E. J. Schwartz,
and M. Woo, “The art, science, and engineering of fuzzing: A survey,”
IEEE Trans. Software Eng., vol. 47, no. 11, pp. 2312–2331, 2021.
[Online]. Available: https://doi.org/10.1109/TSE.2019.2946563
K. Serebryany, “OSS-Fuzz - Google’s continuous fuzzing service for
open source software,” Talk at the 26th USENIX Security Symposium,
2017.
Google, “Oss-fuzz,” 2022. [Online]. Available: https://github.com/
google/oss-fuzz#trophies
Z. Jiang, X. Jiang, A. Hazimeh, C. Tang, C. Zhang, and
M. Payer, “Igor: Crash deduplication through root-cause clustering,”
in CCS ’21: 2021 ACM SIGSAC Conference on Computer and
Communications Security, Virtual Event, Republic of Korea, November
15 - 19, 2021. ACM, 2021, pp. 3318–3336. [Online]. Available:
https://doi.org/10.1145/3460120.3485364
T. Blazytko, M. Schlögel, C. Aschermann, A. Abbasi, J. Frank,
S. Wörner, and T. Holz, “AURORA: statistical crash analysis
for automated root cause explanation,” in 29th USENIX Security
Symposium, USENIX Security 2020, August 12-14, 2020. USENIX
Association, 2020, pp. 235–252. [Online]. Available: https://www.
usenix.org/conference/usenixsecurity20/presentation/blazytko
S. Shen, A. Kolluri, Z. Dong, P. Saxena, and A. Roychoudhury,
“Localizing vulnerabilities statistically from one exploit,” in ASIA
CCS ’21: ACM Asia Conference on Computer and Communications
Security, Virtual Event, Hong Kong, June 7-11, 2021. ACM, 2021, pp.
537–549. [Online]. Available: https://doi.org/10.1145/3433210.3437528
X. Zhang, J. Chen, C. Feng, R. Li, W. Diao, K. Zhang, J. Lei,
and C. Tang, “Default: Mutual information-based crash triage for
massive crashes,” in 44th IEEE/ACM 44th International Conference
on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May
25-27, 2022. ACM, 2022, pp. 635–646. [Online]. Available:
https://doi.org/10.1145/3510003.3512760
W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa,
“A survey on software fault localization,” IEEE Trans. Software
Eng., vol. 42, no. 8, pp. 707–740, 2016. [Online]. Available:
https://doi.org/10.1109/TSE.2016.2521368
H. A. de Souza, M. L. Chaim, and F. Kon, “Spectrum-based
software fault localization: A survey of techniques, advances, and
challenges,” CoRR, vol. abs/1607.04347, 2016. [Online]. Available:
http://arxiv.org/abs/1607.04347
P. Agarwal and A. P. Agrawal, “Fault-localization techniques for
software systems: a literature review,” ACM SIGSOFT Softw. Eng.
Notes, vol. 39, no. 5, pp. 5:1–5:8, 2014. [Online]. Available:
https://doi.org/10.1145/2659118.2659125
“American fuzzy lop (afl),” https://lcamtuf.coredump.cx/afl/.
“honggfuzz,” https://honggfuzz.dev/.
A. Herrera, H. Gunadi, S. Magrath, M. Norrish, M. Payer, and A. L.
Hosking, “Seed selection for successful fuzzing,” in ISSTA ’21: 30th
ACM SIGSOFT International Symposium on Software Testing and
Analysis, Virtual Event, Denmark, July 11-17, 2021. ACM, 2021, pp.
230–243. [Online]. Available: https://doi.org/10.1145/3460319.3464795
D. Wolff, M. Böhme, and A. Roychoudhury, “Explainable fuzzer
evaluation,” CoRR, vol. abs/2212.09519, 2022. [Online]. Available:
https://doi.org/10.48550/arXiv.2212.09519
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
7
M. Böhme, L. Szekeres, and J. Metzman, “On the reliability
of coverage-based fuzzer benchmarking,” in 44th IEEE/ACM 44th
International Conference on Software Engineering, ICSE 2022,
Pittsburgh, PA, USA, May 25-27, 2022. ACM, 2022, pp. 1621–1633.
[Online]. Available: https://doi.org/10.1145/3510003.3510230
Y. Li, S. Ji, Y. Chen, S. Liang, W. Lee, Y. Chen, C. Lyu,
C. Wu, R. Beyah, P. Cheng, K. Lu, and T. Wang, “UNIFUZZ:
A holistic and pragmatic metrics-driven platform for evaluating
fuzzers,” in 30th USENIX Security Symposium, USENIX Security
2021, August 11-13, 2021. USENIX Association, 2021, pp.
2777–2794. [Online]. Available: https://www.usenix.org/conference/
usenixsecurity21/presentation/li-yuwei
Google, “Fuzzbench reports,” 2020. [Online]. Available: https:
//www.fuzzbench.com/reports/
H. Kaur, H. S. Pannu, and A. K. Malhi, “A systematic review on
imbalanced data challenges in machine learning: Applications and
solutions,” ACM Comput. Surv., vol. 52, no. 4, pp. 79:1–79:36, 2019.
[Online]. Available: https://doi.org/10.1145/3343440
B. Dolan-Gavitt, P. Hulin, E. Kirda, T. Leek, A. Mambretti,
W. K. Robertson, F. Ulrich, and R. Whelan, “LAVA: large-scale
automated vulnerability addition,” in IEEE Symposium on Security
and Privacy, SP 2016, San Jose, CA, USA, May 22-26, 2016.
IEEE Computer Society, 2016, pp. 110–121. [Online]. Available:
https://doi.org/10.1109/SP.2016.15
A. Hazimeh, A. Herrera, and M. Payer, “Magma: A groundtruth fuzzing benchmark,” Proc. ACM Meas. Anal. Comput. Syst.,
vol. 4, no. 3, pp. 49:1–49:29, 2020. [Online]. Available: https:
//doi.org/10.1145/3428334
J. Metzman, L. Szekeres, L. Simon, R. Sprabery, and
A. Arya, “FuzzBench: an open fuzzer benchmarking platform
and service,” in ESEC/FSE ’21: 29th ACM Joint European
Software Engineering Conference and Symposium on the
Foundations of Software Engineering, Athens, Greece, August
23-28, 2021. ACM, 2021, pp. 1393–1403. [Online]. Available:
https://doi.org/10.1145/3468264.3473932
C. Luk, R. S. Cohn, R. Muth, H. Patil, A. Klauser, P. G. Lowney,
S. Wallace, V. J. Reddi, and K. M. Hazelwood, “Pin: building
customized program analysis tools with dynamic instrumentation,” in
Proceedings of the ACM SIGPLAN 2005 Conference on Programming
Language Design and Implementation, Chicago, IL, USA, June
12-15, 2005. ACM, 2005, pp. 190–200. [Online]. Available:
https://doi.org/10.1145/1065010.1065034
D. Bruening, Q. Zhao, and S. P. Amarasinghe, “Transparent dynamic
instrumentation,” in Proceedings of the 8th International Conference
on Virtual Execution Environments, VEE 2012, London, UK, March
3-4, 2012 (co-located with ASPLOS 2012). ACM, 2012, pp. 133–144.
[Online]. Available: https://doi.org/10.1145/2151024.2151043
A. Fioraldi, D. C. Maier, D. Zhang, and D. Balzarotti, “LibAFL: A
framework to build modular and reusable fuzzers,” in Proceedings of
the 2022 ACM SIGSAC Conference on Computer and Communications
Security, CCS 2022, Los Angeles, CA, USA, November 7-11,
2022. ACM, 2022, pp. 1051–1065. [Online]. Available: https:
//doi.org/10.1145/3548606.3560602
“Fuzzing unification framework,” https://github.com/fuzzuf/fuzzuf.
A PPENDIX
TABLE II: Details of targeted vulnerabilities.
Program
CVE ID
Root Cause
Crash Cause
heap buffer overflow
#1
LibTIFF
CVE-2016-10094
off-by-one error
#2
Libjpeg
CVE-2018-19664
incomplete check
heap buffer overflow
#3
Libjpeg
CVE-2017-15232
missing check
null pointer dereference
#4
Libxml2
CVE-2017-5969
incomplete check
null pointer dereference
#5
mruby
None
missing check
type confusion
#6
readelf
CVE-2019-9077
missing check
heap buffer overflow
#7
Lua
CVE-2019-6706
missing check
use-after-free
Target #5 was not assigned a CVE ID but was assigned ID 185041 in the
HackerOne platform.
8