5.2.1 RQ1: Does ConfigFuzz Outperform Baselines?.
Figures
5 to
7 show line coverage growth of each fuzzer over time on all the six target programs. Results of default (Baseline-def), ConfigFuzz-1, ConfigFuzz-2, ConfigFuzz-max, and covering arrays (Baseline-2-way) are represented by lines in black, blue, red, orange, and green, respectively.
Overall, the performance of ConfigFuzz varied based on the target programs. ConfigFuzz clearly outperformed the baselines on xmllint, gif2png, cxxfilt, and FFmpeg, as a result of ConfigFuzz prioritizing configurations that may lead to larger code coverage. However, Baseline-2-way and/or Baseline-def achieved higher coverage on nm and objdump. In general, ConfigFuzz performed better than the baseline settings on programs that can be well-explored in the 24-hour timeout (i.e., xmllint, gif2png, and cxxfilt) by effectively allocating resources in the whole configuration space. It is also possible for ConfigFuzz to perform well on more complicated programs (i.e., FFmpeg) due to the same reason. While on other complicated programs (i.e., nm and objdump), ConfigFuzz may spend a lot resources fuzzing the configurations that continuously generate new coverage, but not noticing that exploration of the other configurations may lead to even more coverage. We now analyze the performance on each program in detail.
For xmllint (Figures
5(a) and
5(b)), every setting of ConfigFuzz outperformed the two baselines by a large margin for both AFL and AFL++. ConfigFuzz-1, ConfigFuzz-2, and ConfigFuzz-max not only achieved higher coverage than Baseline-def and Baseline-2-way at a very early stage of the fuzzing campaign, but also grew faster. This is a strong indication that ConfigFuzz outperformed the baselines by exploring more command-line options. For both AFL and AFL++, Baseline-def only covered less than 7,000 lines while every setting of ConfigFuzz covered more than 15,000 lines. This result indicates that a large portion of xmllint code may not be reachable through its default configuration. Using more preset configurations (Baseline-2-way) did increase the coverage of the fuzzers, reaching 13,000 lines, but still less effective than ConfigFuzz.
To provide more insights on how ConfigFuzz performed during the fuzzing campaign, we collected all the generated seeds and extracted their configurations. We analyzed the distribution of the options that appeared in these configurations. Columns 1–3 in Table
6 show the five most frequent options in ConfigFuzz-1, ConfigFuzz-2, and ConfigFuzz-max results. Most options rarely appeared (less than 5%) in the generated configurations. The three most frequent options in ConfigFuzz-1 results were
-html,
-recover, and
-repeat; they appeared 15,523, 7,622, and 4,004 times, respectively. These options were also frequently fuzzed by ConfigFuzz-2 and ConfigFuzz-max. This shows that ConfigFuzz was able to frequently fuzz these non-default options that led to higher coverage. By investigating the source code, we observed that enabling
-html and/or
-recover allows ConfigFuzz to reach many unique lines. However, enabling the
-repeat option would iteratively execute xmllint’s main functionality (which parses and prints the input xml file) 100 times. This may confuse the fuzzers on its potential to generate new coverage.
Results on cxxfilt (Figures
6(a) and
6(b)), FFmpeg (Figures
6(c) and
6(d)), and gif2png (Figure
6(e)) show similar trend where all settings of ConfigFuzz outperformed the two baselines, while Baseline-def always performed the worst. ConfigFuzz also consistently fuzzed a small set of options more frequently across the three ConfigFuzz settings. Different from xmllint, these frequently fuzzed options in gif2png did not contribute much to ConfigFuzz’s performance. Actually, the number of unique lines exposed by enabling each individual option in gif2png is usually small. Nevertheless, fuzzing all options led to ConfigFuzz outperforming Baseline-def. The option
-w caused the poor performance of Baseline-2-way. By enabling
-w, gif2png lists images without animation or transparency and exits earlier compared to other options. Baseline-2-way has about half of its 2-way covering arrays enabling
-w. However, ConfigFuzz allocated much less resources on
-w. Unlike xmllint, the baselines produced better coverage at some stages of the fuzzing campaigns. For example, for AFL-cxxfilt, Baseline-def achieved higher coverage than ConfigFuzz in the first two hours but was eventually surpassed in a few hours. This result suggests that ConfigFuzz may not always find the most effective configurations to fuzz at the beginning but was capable of finding such configurations given time. The most frequently fuzzed options by ConfigFuzz in cxxfilt,
–format and
-t, are both important. The main functionality of cxxfilt is to demangle a string, and the settings of
–format decide the demangling style. Similar to the
-w option in gif2png, disabling
-t terminates cxxfilt’s execution earlier compared to other options.
The results on nm (Figures
7(a) and
7(b)), however, show that ConfigFuzz performed worse than both baselines. Baseline-2-way was significantly better than other approaches, reaching 15,000 and 17,500 lines using AFL and AFL++, respectively. Even Baseline-def outperformed all ConfigFuzz settings. To understand this behavior, in Table
7, we extracted the 10 most frequently fuzzed options in ConfigFuzz-1, ConfigFuzz-2, and ConfigFuzz-max using AFL. We observe that different options were fuzzed more frequently in these settings. Over the 24 hours of fuzzing, the frequently fuzzed options such as
–target,
-defined-only and
-a continuously generated new coverage, a potential reason why ConfigFuzz did not allocate more resources on other options. This indicates that ConfigFuzz might not be the most efficient when handling the large program and configuration space in nm.
AFL and AFL++ produced different results on objdump (Figures
7(c) and
7(d)). Baseline-def had the best performance using AFL with ConfigFuzz-max being the second; while using AFL++, ConfigFuzz-2 was the best and Baseline-def was the second. First, unlike most other programs, the coverage growth on objdump did not flatten after a few hours of fuzzing. This indicates that the search space of this program is large and all fuzzers; even when only the default configuration is fuzzed, the fuzzers continue discovering new coverage steadily over the 24 hours. Second, the results in Figures
7(c) and
7(d) show that the effectiveness of ConfigFuzz also depends on the fuzzer. Using AFL and AFL++, ConfigFuzz-2 performed significantly different, covering about 20,000 and 90,000 lines, respectively.
5.2.2 RQ2: How Do ConfigFuzz-1, ConfigFuzz-2, and ConfigFuzz-max Compare?.
Comparing the performance of ConfigFuzz-1, ConfigFuzz-2, and ConfigFuzz-max in Figures
5 to
7, ConfigFuzz-max did not always produce the highest coverage among these ConfigFuzz settings, and ConfigFuzz-2 outperformed ConfigFuzz-1 in most cases. The performance of these ConfigFuzz settings was impacted by two aspects of the configuration space of the target programs. First, there exist interactions between two options, where some source code lines can only be reached by either enabling both options or enabling one and disabling the other. Second, some options cause longer running time of the program execution, and when ConfigFuzz generates configurations including such option(s), the fuzzing process will be slower.
On xmllint (Figures
5(a) and
5(b)), ConfigFuzz-2 covered 18% more lines than ConfigFuzz-1 in 24 hours (statistically significant through Mann-Whitney U test [
15,
20]), which can be attributed to option interactions. One of the options generated frequently by ConfigFuzz was
–html. It interacts with
–push,
–memory,
–insert,
–xlmout, and
–debugent and was the main reason why ConfigFuzz-2 outperformed ConfigFuzz-1. Although ConfigFuzz-max outperformed ConfigFuzz-1, it was worse than ConfigFuzz-2. This result indicates that while covering option interactions may help improve the coverage, considering many option interactions may make ConfigFuzz less effective.
In Figure
6(e) the coverages of ConfigFuzz-1 and ConfigFuzz-2 on gif2png are close. This is because there is only one interaction in gif2png’s options by enabling
-O and disabling
-r, and the interaction is associated with only a few unique lines.
On objdump, as discussed above, due to the large size of this program, the performance of ConfigFuzz-1, ConfigFuzz-2, and ConfigFuzz-max varied significantly between AFL and AFL++. ConfigFuzz-max covered more than three times the number of lines than that of ConfigFuzz-1 or ConfigFuzz-2 using AFL, while ConfigFuzz-2 was the best when using AFL++. The performance between ConfigFuzz-1 and ConfigFuzz-2 is not significantly different in other programs, in most cases. Interestingly, ConfigFuzz-max in a few cases performed worse than both ConfigFuzz-1 and ConfigFuzz-2. This is because ConfigFuzz-max was slowed down by including too many options in its generated configurations. For example, on gif2png AFL ran about 11,000,000 executions with ConfigFuzz-2 and about 6,800,000 executions with ConfigFuzz-max.
5.2.3 RQ3: How Does ConfigFuzz Perform on String Options?.
For the four programs with string options (FFmpeg, nm, objdump, and xmllint), we additionally fuzzed them including the string options. Figures
8 and
9 show the coverage growth plots comparing ConfigFuzz-1 and ConfigFuzz-2 with ConfigFuzz-str-1 and ConfigFuzz-str-2 for these programs. Results of ConfigFuzz-1, ConfigFuzz-2, ConfigFuzz-str-1, and ConfigFuzz-str-2 are represented by lines in blue, red, yellow, and purple, respectively.
For xmllint (Figures
8(a) and
8(b)), ConfigFuzz performed better with string options included in the fuzzing space. ConfigFuzz-str-1 and ConfigFuzz-str-2 achieved higher coverage than ConfigFuzz-1 and ConfigFuzz-2 from early stage. At the end of 24 hours, fuzzing configurations with string options on xmllint resulted in more than 50% increase in line coverage. The string option
-xpath was frequently fuzzed and enabled ConfigFuzz to reach many unique lines. Also, this string option was more frequently fuzzed by ConfigFuzz-str-2 than ConfigFuzz-str-1, as it was more likely to be generated in a configuration with two options, compared to a configuration with a single option.
Adding the string options did not help ConfigFuzz improve the performance on nm. As shown in Figures
8(c) and
8(d), the coverages of ConfigFuzz-str-1 and ConfigFuzz-str-2 are similar to the coverages of ConfigFuzz-1 and ConfigFuzz-2. The only string option
–ifunc-chars in nm was not frequently fuzzed by ConfigFuzz-str-1 and ConfigFuzz-str-2. This is likely because
–ifunc-chars only controls 4 unique lines in nm and was quickly exploited by ConfigFuzz.
String options in objdump are mostly structured strings. For example, the
–ctf and
–ctf-parent options take section names that are hard to be generated by ConfigFuzz. An exception is
–source-comment, which prefixes its setting to the source code lines. ConfigFuzz spent a lot of resources on this option but did not achieve higher coverage. In Figures
9(a) and
9(b), ConfigFuzz-str-1 and ConfigFuzz-str-2 did not outperform ConfigFuzz-1 and ConfigFuzz-2.
FFmpeg has 12 string options, and we set the max size of string setting to be 19 in this evaluation. The implementation of ConfigFuzz enforced all options that are not bool type (23 in FFmpeg, as shown in Table
5) to have the same size of setting bytes, meaning that there needs to be 460 setting bytes to represent these options. When the input data size is smaller than the configuration bytes, ConfigFuzz will run the target program with the default configuration. ConfigFuzz-str-1 and ConfigFuzz-str-2 were not able to generate large enough seeds during most of the fuzzing time, resulting in their worse performance than ConfigFuzz-1 and ConfigFuzz-2.