4.1 Case Study
To carry out experiments in this work, we used a collection of 19 RESTful APIs plus one artificial case study named
rest-faults for the experiment on fault detection. Table
1 shows the statistics of these 20 APIs.
As we need to measure code coverage and analyze the source code to check which parts are not covered, we needed open source APIs that we could run on a local machine. Furthermore, to simplify the collection of code coverage results, it is easier to use APIs written in the same programming language (e.g., Java), or at least use not too many different languages, as each programming language would need to configure its own code coverage tool to analyze the test results. Considering that we wanted to do comparisons with white-box testing as well, which currently only
EvoMaster supports, and that requires some manual configurations (e.g., to set up bytecode instrumentation for the white-box heuristics), we decided to use the same case study that we maintain for
EvoMaster. In particular, we maintain a repository of RESTful APIs called
EMB [
7], which is stored as well on Zenodo [
36]. Note that one of these APIs is coming from one of our industrial partners, which of course we are not allowed to store on EMB. These 19 APIs are written in four different languages: Java, Kotlin, JavaScript, and TypeScript. They run on two different environments/runtimes: JVM and NodeJS. For each SUT, we report the number of total source files (i.e., ending in either
.java,
.kt,
.js, or
.ts), and their number of lines (LOCs). As these also include import statements, empty lines, comments, and tests, for the APIs running on JVM we also report the number of actual line targets for code coverage (measured with the tool JaCoCo [
10]).
For the APIs running on NodeJS, the code coverage is measured with the tool c8, which uses native V8 coverage. By default, the tool c8 will count code coverage only for the files that are loaded by the engine [
5]. For instance, regarding
cyclotron, different LOCs between Files and c8 are due to unreachable files (i.e.,
api.analyticselasticsearch.js,
api.analytics.js,
api.statistics-elasticsearch.js). However, all of the files will be only reached by manually modifying a configuration in
config.js (i.e., the default value is
false). Therefore, we report the number of line targets measured by c8 that could be loaded with the default SUT settings.
For the experiments on fault detection for the black-box fuzzers (Section
4.5), we created a small artificial API, written in Java, with 10 seeded faults. This API is open source, currently available on GitHub.
44.2 Black-Box Testing Experiment Settings: Code Coverage
For each SUT, we created Bash scripts to start them, including any needed dependency (e.g., as ocvn-rest uses a MongoDB database, this is automatically started with Docker). Each SUT is started with either JaCoCo (for JVM) or c8 (for NodeJS) instrumentation, to be able to collect code coverage results at the end of each tool execution.
Each of the seven compared tools was run on each of the 19 SUTs, repeated with different seeds a certain amount of times (e.g., repeated 10 times for a budget setting of 1 hour), to keep into account the randomness of these tools. Each script starts a SUT as a background process and then one of the tools. Each script runs the SUT on a different TCP port, to enable running any of these scripts in parallel on the same machine.
The code coverage is computed based on all HTTP calls done during the fuzzing process and not on the output of the generated test files (if any). This was done for several reasons: not all tools generate test cases in JUnit on JavaScript format, the generated tests might not compile (i.e., bugs in the tools), and setting up the compilation of the tests and running them for collecting coverage would be quite challenging to automate (as each tool behaves differently). This also means that if a tool crashes, we are still measuring what code coverage it achieves. If a tool crashes immediately at startup (e.g., due to failures in parsing the OpenAPI/Swagger schemas), we are still measuring the code coverage achieved by the booting up of the SUT.
All experiments for this article were run on a Windows 10 machine with a 2.40-GHz, 24-core Intel Xeon processor with 192 GB of RAM. To avoid potential issues in running jobs in parallel (e.g., exhausting the resources of the operating system, which could lead to possible timeout issues), we only ran eight jobs at a time. This means that each experiment had 3 cores (six virtual threads) and 24 GB of RAM.
Regarding the selected seven fuzzers, fuzzers exist that do not provide an option to configure a global time budget to terminate fuzzing (e.g.,
Schemathesis [
19] and
RESTest [
16]). Additionally, although some fuzzers provide the option of a timeout (e.g.,
RestTestGen), they might terminate much earlier than the specified timeout value [
44]. To make the comparison of the fuzzers more fair by applying the same time budget, given the same time budget
\(X\) (e.g., 1 hour), for each fuzzer we run it in a loop with the same
\(X\) timeout (i.e., if the fuzzer runs out of time, it would be terminated, and if the fuzzer completes but there is still some time remaining, it would be restarted to generate more tests). Thus, all coverage we collected is based on the same time budget for all fuzzers.
To compare these tools, we use the line coverage reported by JaCoCo and c8 as the metric. Another important metric would be fault detection. However, how to compute fault detection in an unbiased way, applicable for all compared tools, is far from trivial. Each tool can self-report how many faults they find, but how such fault numbers are calculated might be quite different from tool to tool, making any comparison nearly meaningless. Manually checking (possibly tens of) thousands of test cases is not a viable option either. Therefore, for this type of experiment, line coverage was the most suitable metric for the comparisons. Still, fault detection is a very important metric needed to properly compare fuzzers. For the black-box fuzzers, this will be evaluated with an ad hoc API, as explained in more detail in Section
4.5. For the white-box experiments (Section
4.6), this is not a problem, as we use the same tool (i.e.,
EvoMaster).
Regarding experiment setup, as a black-box testing tool for fuzzing REST APIs, all tools are required to configure where the schema of the API is located. In all of these SUTs used in our case study, the schemas are provided by the SUTs directly via an HTTP endpoint. However, we found that most of the tools do not support fetching the schema with a given URL, such as
http://localhost:8080/v2/api-docs. To conduct the experiments with these tools, after the SUT starts, we developed a Bash script that manages to fetch the schema and download it to the local file system, and then configure a file path for the tools to specify where the schema can be accessed.
Regarding additional setups to execute the tools,
EvoMaster,
RestCT, and
Schemathesis were the simplest to configure because they require only one setup step, as all of their options can be specified with command line arguments. However,
RestCT currently does not work directly on Windows [
15]. Therefore, for these experiments, we simply ran it via the Windows Subsystem for Linux. This might introduce some time delays compared to running it directly on a Linux machine. However,
RestCT is the only tool that has constraints on the type of operating system on which it can run, and practitioners in industry who use Windows would have to run it with the Windows Subsystem for Linux (or Docker) as well.
RestTestGen requires having a JSON file (i.e.,
rtg_config.json) to configure the tool with available options [
18].
RESTler requires multiple setup steps—for example,
RESTler needs to generate grammar files first and then employ such grammars for fuzzing. However, with its online available documentation, we could write a Python script about how to run the tool. For
RESTest, it requires a pre-step to generate a test configuration to employ the tool, and such generation could be performed automatically by a utility
CreateTestConf provided by the tool. To use
bBOXRT, a Java class file is required to load and set up the API specification. At the time of writing this article, specific documentation about how to specify such Java class does not exist. However, in its open source repository, there exist many examples that helped us create these Java classes for the SUTs in our case study. Note that for these experiments in this article, all of the preceding setups were performed automatically with our Bash scripts.
All manual steps to configure the tools that we automated in our Bash scripts take some time. However, such time was not taken into account in the time budget (e.g., 1 hour) used in the experiments. In other words, each tool was run for the same amount of time regardless of the time it took to set it up. There are two main reasons for this. First, evaluating the cost of each manual step in a sound way would require empirical studies with human subjects, taking into account several different properties (e.g., experience/seniority of the participants and familiarity with the existing fuzzers). But this kind of empirical study would be beyond the scope of this work. Second, tool setups on an API are a one-time cost. The same API would be tested continuously throughout its lifecycle (e.g., on a Continuous Integration server), each time new changes/features are introduced. In this context, such one-time setup cost would be negligible. Furthermore, when dealing with several APIs to test (e.g., in a microservice architecture), it can well be that the setups are quite similar, and it could be just a trivial matter of copy&paste when setting up the fuzzers for a new API (e.g., this is what we have experienced when applying
EvoMaster in large companies like Meituan [
73]). As anyway those manual costs were small in terms of time (in order of minutes) but excluding the time to study the documentation of these tools, we do not consider this as a major threat to the validity of our study.
The first time we ran the experiments, we could collect results only for
EvoMaster,
RESTler, and
Schemathesis. All of the other tools failed to make any HTTP calls. For example, this was due to a mismatched schema format or missing/misconfigured information in the schemas. More specifically,
bBOXRT only allows a schema with YAML format. In the SUTs used in this study, there is only 1 specified with YAML (i.e.,
restcountries) out of the 19 schemas (the remaining ones use JSON).
RestTestGen only accepts a schema with OpenAPI v3 in JSON format [
18]. There are only 2 (i.e.,
cwa-verification and
spacex-api) out of the 19 SUTs that expose OpenAPI v3 in JSON format. In addition,
RESTest and
RestTestGen need the protocol information (e.g.,
http or
https with the
servers/
schemes tag) in the OpenAPI/Swagger schema. But since the
servers/
schemes tag is not mandatory, such information might not always be available in the schema. For example, 7 (i.e.,
cyclotron,
disease-sh-api,
js-rest-ncs,
js-rest-scs,
realworld-app,
features-service, and
restcountries) out of the 19 SUTs have such protocol information specified in their schemas. Additionally, to create HTTP requests,
RestCT,
RESTest, and
RestTestGen require information specified in
host (for schema version 2) and
servers (for schema version 3), but such information (typically related to TCP port numbers) might not be fully correct (e.g., the host and TCP port might refer to the production settings of the API and not reflecting when the API is running on the local host on an ephemeral port for testing purposes). Those three tools do not seem to provide ways to override such information. For instance, in 19 SUTs, 10 SUTs (i.e.,
cyclotron,
disease-sh-api,
js-rest-ncs,
js-rest-scs,
realworld-app,
spacex-api,
cwa-verification,
features-service,
languagetool, and
restcountries) are specified with a hard-coded TCP port, and in 1 SUT (i.e.,
scout-api), the TCP port is unspecified. To avoid these issues in accessing the SUTs, we developed a utility using the
swagger-parser library that facilitates converting formats of schemas between JSON and YAML (only applied for
bBOXRT and
RestTestGen), converting OpenAPI v2 to OpenAPI v3 (only applied for
RestTestGen), adding missing
schemes information, and correcting/adding
host and
servers information in the schemas.
Once these changes in the schemas were applied, we repeated the experiments, to collect data from all the seven tools. Ideally, these issues should be handled directly by the tools. However, as they are rather minor and only required changes in the OpenAPI/Swagger schemas, we decided to address them, to be able to collect data from all seven tools and not just from three of them.
In addition, we needed to configure authentication information for five APIs, namely proxyprint, scout-api, ocvn-rest, realworld-app, and spacex-api. For proxyprint, scout-api, and spacex-api, they need static tokens sent via HTTP headers. This was easy to set up in RestCT, EvoMaster, and Schemathesis, just by calling these tools with a –header input parameter. RESTest required to rewrite the test configuration file by appending authentication configuration. RESTler and RestTestGen required writing some script configurations. bBOXRT has no documentation to setup authentication, but we managed to set it up by studying the examples it provides in its repository.
Regarding ocvn-rest and realworld-app, for authentication, it requires making a POST call to a form-login endpoint and then using the received cookie in all following HTTP requests. Out of the seven compared tools, it seems that RESTler, bBOXRT, Schemathesis, and RestTestGen could directly support this kind of authentication by setting it up with an executable script. Given the provided documentation, we did not manage to configure it, as it requires writing different scripts for different fuzzers to manually make such HTTP login calls and then handle responses. Technically, by writing manual scripts, it could be possible to use EvoMaster, RestCT, and RESTest as well, by passing the obtained cookie with the –header option or the test configuration file. As doing all of this was rather cumbersome, and considering that for this API the authentication is needed only for admin endpoints, we decided to not spend significant time trying to set up this kind of dynamic authentication token.
4.3 RQ1: Black-Box Testing Experiments
To answer RQ1, each of the seven compared tools was run on each of the 19 SUTs, repeated 10 times to take into account the randomness of these tools. This resulted in a total of 7 \(\times\) 19 \(\times\) \(10 = 1330\) Bash scripts.
In each script, each tool was run for 1 hour, for a total of 1,330 hours (i.e., 55.4 days of computation efforts). Note that the choice of the runtime for each experiment might impact the results of the comparisons. The choice of 1 hour is technically arbitrary, but based on what practitioners might want to use these fuzzers in practice, and also not too long to make running all of these experiments unviable in reasonable time.
Table
2 shows the results of these experiments. For each tool, we report the average (i.e., arithmetic mean) line coverage, as well as the min and max values out of the 10 runs. Each tool is then ranked (from 1 to 7) based on their average performance on each SUT (where 1 is the best rank). A Friedman test is conducted to analyze the variance of these techniques, based on ranks of their performance on the SUTs.
From these results, we can infer a few interesting observations. First, regarding the ranking,
EvoMaster seems to be the best black-box fuzzer (best in 11 out of 19 SUTs, with an average coverage of 56.8%), closely followed by
Schemathesis (best in 7 SUTs, with an average coverage of 54.5%). The variance of the techniques is statistically significant at the significance level 0.05 (i.e.,
\(p \le 0.05\) ) with the Friedman test.
EvoMaster BB achieves the best average rank (i.e., 1.6), whereas
Schemathesis has the second best average rank (i.e., 1.9). Then, the remaining tools can be divided in two groups:
bBOXRT (average rank 4.6) and
RestTestGen (average rank 3.9) with similar coverage of 41.6% to 45.4%, then
RESTler (average rank 5.1),
RESTest (average rank 5.0), and
RestCT (average rank 5.8) with similar coverage of 33.6% to 35.4%. These results confirm a previous study [
43] showing that
RestTestGen gives better results (in terms of black-box criteria) than
RESTler and
RESTest, as well as
RestCT being better than
RESTler [
67] (although in this case, the difference in average coverage is small, only 0.8%). Compared to the analyses in the work of Kim et al. [
53], interestingly the ranking of the tools is exactly the same (recall that out of the combined 29 APIs between our study and their study, only 10 APIs used in these empirical studies are the same).
On all APIs but one of them, either
EvoMaster or
Schemathesis gives the best results. The exception is the industrial API, where five tools achieve the same coverage of 8.2% on all 10 runs. We will discuss this interesting case in more detail in Section
4.7. In 12 APIs, either
EvoMaster is the best followed by
Schemathesis or the other way round. There is no single API in which
EvoMaster and
Schemathesis were not at least the third best.
The other seven APIs (including the industrial one) show some interesting behavior for the other five tools. For example, for js-rest-scs, RestTestGen gives the second best results, with an average 86.4%, compared to 86.1% of Schemathesis. The interesting aspect here is that out of the 10 runs, RestTestGen has worse minimum coverage (85.4% vs. 85.9%) and worse maximum coverage (87.1% vs. 87.4%), although the average is higher (86.4% vs. 86.1%). This can happen when randomized algorithms are used. In gestaohospital-rest, RestTestGen and Schemathesis have similar performance (i.e., 57.2% and 58.7%), whereas EvoMaster is quite behind (i.e., 50.8%). Similarly, the performance of RestTestGen and Schemathesis are quite similar on rest-scs (65.3% and 64.8%) and restcountries (75.4% and 73.9%), where EvoMaster is better only by a small amount (66.9% on rest-scs and 76.1% on restcountries). In languagetool, RESTest is better than Schemathesis, but the difference is minimal (only 0.3%). In rest-ncs, there is large gap in performance between EvoMaster (64.5%) and Schemathesis (94.1%), where the second best results are given by RestCT (85.5%). Finally, on scout-api, RESTler is better than Schemathesis (26.5% vs. 23.0%), although it is way behind EvoMaster (36.7%).
Another interesting observation is that there is quite a bit variability in the results of these fuzzers, as they use randomized algorithms to generate test cases. We highlight some of the most extreme cases in Table
2, for
EvoMaster and
Schemathesis. On
languagetool, out of the 10 runs,
EvoMaster has a gap of 9.1% between the best (35.1%) and worst (26.0%) runs. On
scout-api, the gap is 8.3%. For
Schemathesis, the gap on
features-service is 10.5%, and 6.4% on
gestaohospital-rest. This is yet another reminder of the peculiar nature of randomized algorithms, as well as the importance of how to properly analyze them. For example, doing comparisons based on a single run is unwise.
Statistical tests [
31] are needed when claiming with high confidence that one algorithm/tool is better than another one. In this particular case,
EvoMaster BB achieved the overall best performance based on the significant
\(p\) -value (i.e., <0.05) with the Friedman test and the best average rank over all of the 19 APIs (i.e., 1.6) as shown in Table
2. Additionally, we compare
EvoMaster’s performance with all of the other tools, one at a time on each SUT (so
\(6 \times 19=114\) comparisons), and report the
\(p\) -values of the Mann-Whitney-Wilcoxon U-Test in Table
3. Apart from very few cases, the large majority of comparisons are statistically significant at level
\(\alpha =0.05\) . Often, 10 repetitions might not be enough to detect statistically significant differences, and higher repetition values like 30 and 100 are recommended in the literature [
31]. However, here the performance gaps are large enough that 10 repetitions were more than enough in most cases. Note that, as explained in more detail in the work of Arcuri and Galeotti [
31], we have not applied any
\(p\) -value correction on these multiple comparisons (as they are controversial). We rather report the raw values (rounded up to 0.001 for readability) in case readers still want to apply such corrections when analyzing such data. For completeness, as
Schemathesis achieves the best results on few APIs, Table
4 reports the same kind of analysis, in which
Schemathesis is pairwise compared with all of the other tools.
When looking at the obtained coverage values, all of these tools achieve at least 30% coverage on average. Only two of them (i.e.,
EvoMaster and
Schemathesis) achieve more than 50%. But no tool goes above 60% coverage. This means that although these tools might be useful for practitioners, there are still several research challenges that need to be addressed (we will go into more detail about this in Section
4.7). However, what level of coverage can be reasonably expected from black-box tools (which have no information on the source code of the SUTs) is a question that is hard to answer.
4.6 RQ4: White-Box Testing Experiments
Out of the seven compared tools, only EvoMaster supports white-box testing. EvoMaster uses evolutionary computation techniques, where the bytecode of the SUTs is instrumented to compute different kinds of heuristics. Due to possible conflicts with JaCoCo and c8 instrumentation, and due to the fact that EvoMaster uses its own driver classes (which need to be written manually) to start and stop the instrumented SUTs, this set of experiments was run differently compared to the black-box ones.
We ran
EvoMaster on each SUT for 10 repetitions (so, 190 runs), each one for 1 hour (like for the black-box experiments). However, JaCoCo and c8 are not activated. After each run,
EvoMaster generates statistic files, including information on code coverage and fault detection. However, as there can be differences on how line coverage is computed between
EvoMaster and JaCoCo/c8, it would be difficult to reliably compare with the results in Table
2. Therefore, for the comparisons, we ran
EvoMaster as well in gray-box mode (for another 190 runs). This mode generates test cases in exactly the same way as black-box mode, with the difference that the SUT is instrumented, and coverage metrics are computed at each test execution. A further benefit is that besides code coverage, we can reliably compare fault detection as well, as this metric is computed in exactly the same way (as it is the same tool). As
EvoMaster is the black-box fuzzer that gives the highest code coverage (Section
4.3), it is not a major validity threat to compare white-box results only with
EvoMaster. These 380 runs added a further 15.8 days of computational effort.
Table
8 shows these results. A few things appear quite clearly. First, on average, line coverage goes up by 7.5% (from 45.4% to 52.9%). Even with just 10 runs per API, results are statistically significant in most cases.
For APIs like rest-ncs, average coverage can go even higher than 90%. For other APIs like features-service, improvements are more than 13% (e.g., from 68.8% to max 81.1%). In the case of ind0, although achieved coverage is relative low (i.e., <20%), it still more than double (i.e., from 8.4% to 18.8%).
Although results are significantly improved compared to black-box testing, for nearly half of these SUTs it was still not possible to achieve more than 50% line coverage. Additionally, there are two interesting cases in which results with white-box testing are actually significantly
worse (i.e., for
realworld-app and
ocvn-rest). In the former case, the difference is minimal, just 0.2%. In the latter case, however, the difference is substantial, as it is 10.5% (from 35.3% down to 24.8%). This is not an unusual behavior for search algorithms, as it all depends on the quality of the
fitness function and the properties of the
search landscape [
42]. If a fitness function gives no gradient to the search algorithm, it can easily get stuck in local optima. In such cases, a random search can give better results.
EvoMaster uses several heuristics in its fitness function to try to maximize code coverage, but those do not work for
ocvn-rest, as we will discuss in more detail in Section
4.7. However, improving the fitness function in this case can be done (which we will address in future versions of
EvoMaster).
Regarding fault detection, white-box testing leads to detection of more faults. In the context of REST API testing, a 500 status code is regarded as a potential fault occurring in the SUT, which can be identified by both black-box and white-box approaches, by extracting information from the HTTP responses. It is not unexpected that a white-box approach achieves better performance, as usually there is a strong correlation between code coverage and fault detection [
41]. For example, you cannot detect a fault if the code it hides in is never executed. However, it also relies on the employed fitness functions and heuristics, such as whether they focus more on fault detection than on other test criteria (e.g., code coverage and schema-related coverage [
48]). The interesting aspect here is that the improvement is not much, just 3.4 more faults on average. In these experiments, random testing can find 36.5 faults on average, and so the relative improvement is less than 10%. For the problem domain of fuzzing RESTful APIs (and likely web APIs in general), this is not surprising [
57]. Many of these APIs do crash as soon as an invalid input is provided, instead of returning a proper user error response (e.g., with the HTTP status code in the 4xx family). Additionally, random search is quite good at generating invalid input values. Therefore, many faults can be easily found this way with a fuzzer at the very first layer of input validation in the SUT’s code, even when the achieved code coverage is relatively low.
4.7 RQ5: Open Problems
To identify open problems, we performed detailed analysis based on results of existing fuzzers on real APIs (18 open source and 1 industrial). We ran seven tools/configurations on 19 RESTful APIs, with a budget of 1 hour per experiment. However, only in one single case was it possible to achieve 100% line coverage (i.e.,
Schemathesis on
js-rest-ncs, recall Table
2). In general, it might not be possible to achieve 100% line coverage, as some code can be unreachable. For example, this is a common case for constructors in static-method-only classes, as well as catch blocks for exceptions that cannot be triggered with a test. However, for several of these SUTs, it was not even possible to reach 50% line coverage.
It is not within the scope of this work to define a good coverage target to aim for (80%? 90%?). However, clearly, the higher the code coverage, the better it would be for practitioners using these fuzzers. Therefore, to improve those fuzzers, it is of paramount importance to understand their current limitations. To answer this question, we studied in detail the logs of the tools in Section
4.7.1 and large parts of the source code of the SUTs in Section
4.7.2 (recall that those are more than 280,000 LOCs; see Table
1). An in-depth analysis of the current open problems in fuzzing RESTful APIs is an essential scientific step to gain more insight into this problem. This is needed to be able to design novel and more effective techniques.
Here, we do two different types of analyses. First, we look at the logs generated by the tools (Section
4.7.1), which helps in pointing out possible major issues in these fuzzers (e.g., when they crash, they might generate log messages with stack traces). Second, we run the generated tests and manually check the code they execute (Section
4.7.2), to analyze what was not covered. This helps provide hypotheses to explain why that was the case, which is needed to be able to design new techniques to achieve higher coverage. Finally, we summarize all of these findings (Section
4.7.3).
It is important to stress that the goal of these analyses is to identify general problems, or instances of problems that are likely going to be present in other SUTs as well. Designing novel techniques that just overfit for a specific benchmark is of little to no use. Ultimately, what is important will be the results that the practitioners can obtain when using these fuzzers on their APIs.
4.7.1 Analysis of the Logs.
From the tool logs, at least four common issues are worthy of discussion. First, OpenAPI/Swagger schemas might have some errors (e.g., this is the case for cyclotron, disease-sh-api, cwa-verification, features-service proxyprint, and ocvn-rest). This might happen when schemas are manually written, as well as when they are automatically derived from the code with some tools/libraries (as those tools might have faults). In these cases, most fuzzers just crash, without making any HTTP call or generating any test case. It is important to warn the users of these issues with their schema, but likely fuzzers should be more robust and not crash (e.g., endpoints with schema issues could be simply skipped).
The second issue can be seen in
languagetool. Most fuzzers for RESTful APIs support HTTP body payloads only in JSON format. However, in HTTP, any kind of payload type can be sent. JSON is the most common format for RESTful APIs [
63], but there are others as well, like XML and the
application/x-www-form-urlencoded used in
languagetool. From the results in Table
2, it looks like only
EvoMaster supports this format. On this API,
EvoMaster achieves between 26% and 35.1% code coverage, whereas no other fuzzer achieves more than 2.5%.
The third issue is specific to scout-api, which displays a special case of the JSON payloads. Most tools assume a JSON payload to be a tree—for example, a root object A that can have fields that are object themselves (e.g., A.B), and so on recursively (e.g., A.B.C.D and A.F.C), in a tree-like structure. However, an OpenAPI/Swagger schema can define objects that are graphs, asis the case for scout-api. For example, an object A can have a field of type B, but B itself can have a field of type A. This recursive relation creates a graph that needs to be handled carefully when instantiating A (e.g., optional field entries can be skipped to avoid an infinite recursion, which otherwise would lead the fuzzers to crash).
The fourth issue is related to the execution of HTTP requests toward the SUTs. To test a REST API, the fuzzers build HTTP requests based on the schema and then use different HTTP libraries to send the requests toward the SUT—for example,
JerseyClient is used in
EvoMaster. By analyzing the logs, we found that for several case studies, some fuzzers seem to not have problems in parsing schemas, but they fail to execute the HTTP requests. For instance,
javax.net.ssl.SSLException: Unsupported or unrecognized SSL message was thrown when
RestTestGen processed
realworld-app with
OkHttpClient. For
js-rest-ncs,
js-rest-scs, and
ind0, we found that
404 Not Found responses were always returned when fuzzing them with
RESTler v8.5.0. By checking the logs obtained by
RESTler, we found that it might be due to a problem in generating the right URLs for making the HTTP requests. For the SUT whose
basePath is
/,
RESTler seems to generate double slash (i.e.,
//) in the URL of the requests. However, whether to accept the double slash to match a real path depends on the SUTs (i.e.,
js-rest-ncs,
js-rest-scs, and
ind0 do not allow it). In Figure
2, we provide the logs obtained by
RESTler, representing the processed requests that contain the double slash and responses returned by
js-rest-ncs (Figure
2(a)) and
rest-ncs (Figure
2(b)).
4.7.2 Analysis of the Source Code.
For each SUT, we manually ran the best (i.e., highest code coverage) test suite (which are the ones generated by EvoMaster) with code coverage activated. Then, in an IDE, we manually looked at which branches (e.g., if statements) were reached by the test case execution but not covered, as well as looking at the cases in which the test execution is halted in the middle of a code block due to thrown exceptions. This is done to try to understand what are the current issues and challenges that these fuzzers need to overcome to get better results.
To be useful for researchers, these analyses need to be of a “low level”, with concrete discussions about the source code of these APIs. No general theories can be derived without first looking at and analyzing the single instances of a scientific/engineering phenomenon. As such software engineering discussions might depend on different groups of readers who might target different levels of detail in the software, or considering readers who might not be currently active in the development of a fuzzer, here we provide only a summary. Full analyses for each SUT are provided in Appendix
A.
Table
9 summarizes all identified
main issues for each SUT. There can be many reasons a higher coverage was not achieved. Here, based on our analyses, we have categorized six main issues:
Authentication:
In some cases, there is the need to provide authentication information with specific roles, using specific types of authentication (e.g., an API can provide different ways to authenticate).
Databases:
The execution might depend on the data returned by database queries, but those return nothing, as constraints are not satisfied (e.g., the content of WHERE clauses in SQL SELECT queries). It might be hard to generate the right data to satisfy those constraints.
External services:
The API depends on specific interactions with external services (e.g., other APIs on the internet). However, from the point of view of the fuzzers, there is no way to control what those external services return.
Schema inconsistencies:
The schema of the API might be underspecified (e.g., some data constraints are missing, and some query parameters used in the API might not be described) or also might provide wrong information (e.g., wrong types for input parameters).
String constraints:
String input data might need to match specific formats (e.g., based on regular expressions), which is not simple to infer.
Unreachable code:
Some code can be simply unreachable via the entry points of the API. It might be dead code, or code executed by other scripts or functionality (e.g., a GUI and background threads).
Note that these are only the main issues we
currently face. Solving them does not imply maximizing code coverage. As more code is executed, more issues could be identified. A concrete example is
cwa-verification. Such an API does connect to an external service, but the code doing this is currently not executed by any of the fuzzers. Dealing with external services would hence currently give no improvement on
cwa-verification. However, when the other issues in
cwa-verification are solved (recall Table
9), then dealing with external services will be needed to improve coverage results further.
4.7.3 Discussion.
Based on the analyses of the logs and tests generated for the 19 APIs, some general observations can be made:
•
Many research prototypes are not particularly robust and can crash when applied on new SUTs. For example, we have faced this issue with
EvoMaster many times, such as when adding a new API to EMB [
7]. Although we add new APIs to EMB each year, EMB has been available as open source since 2017, and anyone can use it for their empirical studies and make sure their tools do not crash on it. However, it is important to stress that in this work, we have compared
tool implementations rather than
techniques. For example, a tool with low performance (e.g., due to crashes) could still feature novel techniques that could be quite useful (e.g., if integrated or re-implemented in a more mature tool).
•
Like software, schemas can also have faults and/or omissions (e.g., constraints on some inputs might be missing). This issue seems quite common, especially when schemas are written manually. Although this problem could be addressed by white-box testing (currently supported only by EvoMaster) by analyzing the source code of the SUTs, it looks like a major issue for black-box testing, which might not have a viable solution.
•
Interactions with databases are common in RESTful APIs. To execute the code to fetch some data, such data should first be present in the database. The data could be created with endpoints of the API itself (e.g., POST requests), as well as inserted directly into the database with SQL commands. Both approaches have challenges, such as how to properly link different endpoints that work on the same resources, and how to deal with data constraints that are specified in the code of the API and not in the SQL schema of the database. The compared fuzzers provide different solutions to address this problem, but it is clear that more still need to be done.
•
Currently, no fuzzer deals with the mocking of external web services (e.g., using libraries like WireMock). Testing with external live services has many shortcomings (e.g., the generated tests can become flaky), but it might be the only option for black-box strategies. For white-box strategies, mocking of web services will be essential when testing industrial enterprise systems developed with a microservice architecture (as in this class of software, interactions between web services are quite common).
•
Authentication information needs to be provided when fuzzing APIs that require different users to login. Even with a white-box fuzzer, it might not be feasible to automatically create different users with different roles. For example, typically passwords in databases are hashed, and reverse engineering on how to create valid hashed passwords (e.g., when creating data into the database as part of the tests [
32]) is out of reach from current fuzzers. To reach specific parts of the code, there might be the need of specific users with specific roles. If those are not provided/set up before the fuzzing is done, then such code cannot be executed. A tester might provide different user profiles for fuzzing, but some important roles/setups might be missing from these manual configurations.
•
Constraints on strings input seem quite common in RESTful APIs. Those constraints might be missing in the schema (e.g., a specific string input needs to satisfy a given regular expression). White-box fuzzers can analyze the code to see how such strings are manipulated, but it is not trivial. Black-box fuzzers could try to infer these constraints with natural language processing (e.g., based on the names of these inputs, and possibly their description, if any is provided).
Many of the challenges we have identified (e.g., dealing with databases and underspecified schemas) are specific for RESTful APIs, although they would likely apply to the fuzzing of other types of web services as well (e.g., GraphQL [
39,
40] and RPC [
72,
73]). These challenges would not be meaningful in different testing contexts (e.g.,
unit test generation or fuzzing of parser libraries).
Issues with providing manual auth configurations would likely apply to any kind of software that needs authentication information. Dealing with constraints on string data is likely the most general challenge, as it applies as well, for example, to unit testing.
This means that solving these challenges would not only help in the testing of RESTful APIs, but it likely can have positive effects on other testing domains as well.