4.2 API Schema
At a high level, a GraphQL API can be seen a server opening a TCP socket, processing HTTP requests, with body payloads written in a specific format (i.e., using the GraphQL language) for the different functionalities provided by the API. Sending random bytes on such TCP connection would unlikely lead to any meaningful message that would be immediately discarded by the API. Likewise, sending properly formatted GraphQL messages would result only in errors if those messages are not based on the actual entry points and expected input types of the API.
To send meaningful GraphQL messages that would execute the business logic of the API, such messages must be based on the
schema of the API (recall Section
2.1). Each GraphQL API must have a schema definition, which can be retrieved online from the API itself (unless such option is disabled due to security reasons).
To fetch the whole schema from a GraphQL API, an introspective query is used. Given an entry point to the GraphQL API (e.g., typically a /graphql HTTP endpoint), GraphQL enables a standard way to fetch a schema description of the API itself. The schema specifies all the information about the available operation types, such as queries, mutations, and all available data types on each of them. As a result, the GraphQL schema is returned in JSON format.
Let us consider the following example of the introspective query in which we query one of the SUTs used in our empirical study in Section
5 (i.e.,
petclinic [
10]) to obtain the resources that are available.
In this introspective query, we query the field __schema, which provides all information about the schema of a GraphQL service. It is considered as a meta-field used by GraphQL for the introspection system. Such field is accessible from the root type of a query operation, and its type is defined next.
By querying the fields queryType and mutationType, the GraphQL petclinic server will return all queries and mutations available from the schema. In this case, both query and mutation operations are available.
The field named types (of kind __Type) is at the core of the introspection system. It represents all types in the system: both named types (e.g., OBJECT kind) and type modifiers (e.g., NON_NULL kind). A reduced subset (due to its length) of the returned result of the introspective query for this example is shown next.
Here, the Query type is an object that has a field called owner. The owner field takes as argument a non-nullable integer named id and returns an object named Owner.
Like the software of the API itself, the GraphQL schemas can also have faults. This, for example, is a common issue among RESTful APIs [
61]. As the schema is the main source of information on how to prepare syntactically valid requests, issues in the schema can have negative impacts on the performance of the fuzzing sessions [
80].
Most GraphQL frameworks (e.g., Apollo [
3]) do validate the syntax of the API endpoints based on the defined schema—for example, to check the presence and format of each endpoint (i.e., query and mutation methods). In case of errors and mismatches, they would return a response with an error message at each incoming request or simply crash the server at start-up time. This kind of issue can be quickly identified and corrected. However, a schema could be
underspecified. For example, the API could have implementations of endpoints that are not declared in the schema. But it would not be possible to call any of these endpoints, as the GraphQL frameworks running the API would not be aware of those endpoints. Therefore, even in these cases, such issues would be easily detected by users without the need of using any fuzzer.
This means that for GraphQL APIs, in contrast to RESTful APIs (where typically the framework servers do not validate the schemas), problems in the schemas do not seem to be as serious for testing purposes. However, more research will be needed to evaluate this potential issue in more detail.
4.3 Problem Representation
Once the schema of the tested API is fetched, this latter is then parsed in our EvoMaster extension and used to create a set of action templates, one for each query and mutation operation. Each action will contain information on the fields related to input arguments (if any is present) and return values. A chromosome template is defined for each action, which is composed of non-mutable information (e.g., the field’s names) and a set of mutable genes. In this context, each gene characterizes either an argument or a return value in the GraphQL query/mutation.
For objects as return values, a query/mutation must specify which fields should be returned (at least one must be selected), and so on recursively if any of the selected fields are objects as well. To represent the fact that a field is always optional for queries, a return gene is modeled by an object gene where all its fields are optional. However, we had to extend the mutation operator in EvoMaster with a post-processing phase, to guarantee that at least one field gene is selected during the search. In other words, if after a mutation of a gene, which represents a returned object value in the GraphQL query/mutation, all fields are deselected, then the post-processing will force the selection of one of them (and so on recursively if the selected field is an object itself). However, if a return value is a primitive type, then there is no need to create any gene for it, as there is no selection to make. Furthermore, similar to functions calls, fields in the returned value can have input arguments themselves. When a returned value for a parent field is executed, both input arguments and the returned value are recursively selected to generate a child field value until it produces a scalar value whether in input arguments or in returned values. To model those function calls, we introduced a new special type of gene called Tuple, discussed next.
To fully represent what is available from the GraphQL specification, the following kinds of gene types from EvoMaster have been reused and adapted:
(1)
String: This gene contains string variables that are defined by an array of characters. A minimum length of the string is zero, which represents the empty string. Each string gene cannot exceed a pre-defined maximum number of characters (e.g., 100).
(2)
Enum: This gene represents the enumeration type, where a set of possible values is defined, and only one value is activated at a given time. The elements in the set can be in different formats (e.g., enumerations of numbers or enumerations of strings).
(3)
Float/Integer/Boolean: These are genes representing variables with simple data types. Boolean genes represent variables with true or false values. Integer and float genes represent integer and real-value variables, respectively.
(4)
Array: This gene represents a sequence of genes with the same type. This gene has variable length, where elements can be added and removed throughout the search. To mitigate creating too large test cases (e.g., with millions of genes), the size of an array gene should not exceed a given threshold.
(5)
Object: This gene defines an object with a specific set of internal fields. Differently from the array gene, where the elements should be with the same type, an object gene may contain elements with different types. To do so, this gene is represented by a map, where each key in the map is determined by the field name in each element in the object.
(6)
Optional: This is a gene containing another gene, whose presence in the phenotype is controlled by a Boolean value. This is needed, for example, to represent nullable types in arguments and selection of fields in returned objects.
(7)
CycleObject: This special gene is used as a placeholder to avoid infinite cycles, when selecting object fields that are objects themselves, which could be references back to the starting queried object. Once a test case is sampled, its gene tree structure is scanned, and all CycleObject genes are forced to be excluded from the phenotype (e.g., if inside an Optional gene, that gets marked as non-selected, and the mutation is prevented to select it; if the CycleObject is the type for an array gene, such array gets a fixed size of 0, and the mutation operator is prevented from adding new elements in it).
(8)
LimitObject: GraphQL schemas are often very large and complex, and the levels of nesting fields can be potentially huge. We use this special gene as a placeholder when a customized depth limit is reached. The depth is the number of nesting levels of the object fields.
(9)
Tuple: This gene is needed, for example, when representing the inputs of function calls. It is composed of a list of elements of possible different types, where the last element can be treated specially. For example, this is the case of function calls when the return type is an object, on which we need to select what to retrieve (and these selected elements could be function calls as well, and therefore this is handled recursively).
To make this discussion more clear, let us consider a small, simplified portion of the schema of
GitLab (one of the SUTs used in our empirical study in Section
5).
To send the query to the server, the user must follow the preceding representation of the schema. For instance, to query the field fullName, the user might send the following query.
This query is syntactically valid, conforming the schema represented previously. But it is not the only possible query conforming such schema. A user could rather send a
permissionScope with value
TRANSFER_PROJECTS, or simply such optional input could be avoided altogether. So, a genotype representation needs to be able to express all possible queries that are valid for the given schema. The tree in Figure
3 shows a genotype structure (seen as a tree of genes) for such schema, using the previously discussed gene system used in our framework. For example, the action representing the GraphQL query
currentUser has the
object gene
UserCore, which contains an
optional tuple field
groups. When calling
currentUser, one needs to specify which fields of the returned object
UserCore to include in the response. In this particular example, there is only one field called
groups. Considering that which field to return is optional, to represent this, each of these fields is inside an
optional gene. If an optional gene is deactivated, none of its internal genes is used in the phenotype of the test case. The field
groups is itself a remote function call. It is represented with a
tuple gene, having an input argument
permissionScope and return value
GroupConnection. The argument is represented as an
optional gene containing an
enum gene (for the two possible values defined in the schema). The return value
GroupConnection is represented with an
object gene. For each field of an object, we need a gene to represent it. A field can be yet another object, or an array of them, like the case of
nodes. So, this process is applied recursively. The non-method/non-object fields are represented with a
Boolean gene (to check whether it will be part of the returned object or not), like the case of
fullName. In this simplified example, the choices that need to be made are, for example, whether the
optional genes should be active or not, the Boolean values of the
Boolean genes, and values for the
enum genes. The evolutionary process will make modifications to these values throughout the search. From this genotype, then the phenotype will represent syntactically valid queries.
For this subset of the schema, the search space of all possible queries is small. However, it would increase exponentially when dealing with more inputs, particularly for strings.
To fully support the whole specification of GraphQL, there are several special cases that need to be handled, like the use of
interfaces. To deal with GraphQL interfaces, we use an
optional object gene for each type that implements the interface, together with an extra
optional object gene (labeled with BASE) to specify the interface fields themselves. Consider the following portion of the schema from
digitransit (one of the SUTs used in our empirical study in Section
5).
Here, the interface Node has two possible implementations: Trip and TicketType. A user can query different fields based on the concrete types of the returned objects. For example, assume querying the fields id, tripHeadsign, and price, as described next.
The tree in Figure
4 shows the genotype for this user query containing interfaces, based on the gene system used in our framework. Here, whether to query the different concrete types of the interfaces, and their fields, is optional. Indeed, a genotype representation needs to be able to express all these possible kinds of valid phenotypes.
After defining the possible type of genes supported by the proposed framework, we consider the solution space, where each solution is a set of test cases. A test case is composed of one or more HTTP request. To represent an HTTP request, we typically need to deal with its components: HTTP verb, path and query parameters, body payloads (if any), and headers.
A GraphQL request can be sent via HTTP GET (used only for queries) or HTTP POST methods with a JSON body (used for queries and mutations). For simplicity, we only use the verb POST for both queries and mutations. A GraphQL server uses a single URL endpoint (typically /graphql), where the HTTP requests with the GraphQL queries/mutations will be sent. In the context of test generation for a GraphQL API, the main decisions to make are on how to create JSON body payloads to send. The genotype will contain genes (from the set defined previously) to represent and evolve such JSON objects.
In Figure
5, there is an example of a test case generated automatically by
EvoMaster for the
petclinic API, outputted in JUnit format. It is composed of two HTTP POST requests. The first call with a body payload querying for the entry point
specialities and the second requesting for the entry point
owner. When a test case is generated and evaluated, we also provide assertions on the returned responses.
The test cases are generated in a random way, but they are still syntactically valid. For instance, if we consider the example illustrated in Figure
1(a), the test cases are generated by exploring the fields of the pets node. For instance, we consider the field “id” an integer represented in 32 bits. The possible test cases for the field “id” is
\(2^{32}\). We also explore different combinations of two or more fields in each node. For instance, considering the same example illustrated in Figure
1(a), the test cases might be generated from both fields “id” and “name” of the node pets. If we consider the length of the string is limited to 10, the possible tests cases for the field “name” is
\(2^{160}\) (assuming each character is 2 bytes). Therefore, the number of possible test cases by only exploring the fields “id” and “name” is
\(2^{32} \times 2^{160}\), which results in an immense search space. Therefore, in our implementation, and to mitigate the combinatorial explosion, we use a threshold to limit the number of generated test cases that can be evaluated (i.e., we limit the number of test cases we sample during RS).
4.4 Search Operators
Once a chromosome representation is defined based on the GraphQL schema, test cases are evolved and evaluated in the same way as done for RESTful APIs in
EvoMaster (recall Section
2.2), including testability transformations [
25] and SQL database handling [
24]. Internally, the search algorithms in
EvoMaster are implemented in a generic way, independently of the addressed problem (e.g., REST and GraphQL APIs), and it is only a matter of defining an appropriate phenotype mapping function (e.g., how to create a valid HTTP request for a GraphQL API based on the evolved chromosome genotype).
As stated previously, an evolving individual will be a set of
actions (i.e., calls to query and mutation endpoints) on the tested API. Each action is represented with a gene tree template (e.g., recall examples in Figures
3 and
4), which needs to be instantiated (i.e., set the values of the genes). As part of the search, there are three main search operators: (1) random sampling, (2) mutation on the structure of the tests, and (3) mutation on the content of an action. Note that the term
mutation in the context of GraphQL APIs (used to represent an endpoint in the API that can modify its state) has nothing to do with the term
mutation used in the evolutionary computation literature (used to represent search operators that do small changes in the evolving individuals).
When sampling a new individual at random (e.g., needed for RS, as well as for evolutionary algorithms when they need to initialize their first population of individuals to evolve), first there is the need to choose how many actions K it contains. For example, it can be randomly chosen between 1 and N (e.g., where \(N=10\)). Given A, the set of possible action templates (one for each query/mutation endpoint in the API), each of these K actions in the sampled test will be chosen randomly from A. Then, the content of each gene in such trees is set at random (considering their types and constraints).
Mutation operators are used to do small changes to an evolving individual. The structure of the test can be modified by removing an action from the current K (if \(K\gt 1\)) or by adding a new random action from A (if \(K\lt N\)). This can be applied with a given probability P.
The content of an action
a can be modified by selecting any from
K, then selecting randomly any gene from its tree. Given
\(G_a,\) the set of genes in the selected action
a, each gene could be mutated with probability
\(1/|G_a|\). The type of mutation depends on the type of the genes. For example, a numeric gene could have its phenotype value increased or decreased by a certain small delta. A Boolean gene could be flipped from
true to
false and vice versa. A string gene could have some of its chars modified randomly. And so on (full details can be found in the source code of
EvoMaster [
31]).
Consider the last example in Figure
5, where the mutation operators are applied as shown next.
Here, the test case is shrunk to only one action owner (from the original two). The owner’s input argument id is mutated to 1 (from the original 8). The action is mutated by dropping the phenotype of the genes telephone, id, and visits (i.e., their Boolean genes were mutated from true to false), and it is extended by adding to the phenotype the field name (i.e., its Boolean gene is mutated from false to true).
4.5 Fitness Function
The fitness function plays a critical role in an evolutionary algorithm, as it specifies which individuals will survive and reproduce. The main goal of our testing is to find faults in the tested API. A fault cannot manifest if the code in which it lies is not executed. Therefore, an indirect approach to try to find more faults is to maximize the code coverage achieved by the generated tests. However, generating high code coverage tests is a complex task, as the execution flow in the API might depend on complex constraints (e.g., complex predicates in if statements), which could be satisfied only with very specific inputs.
There is a large body of research literature on the topic of maximizing code coverage for software testing. In the case of search-based software testing [
13], there are common techniques like
branch distance [
59] and
testability transformations [
51]. For the work in this article, we do not define any new white-box heuristics. We rather rely on the state-of-the-art white-box heuristics for system test generation provided by
EvoMaster. This includes advanced testability transformations [
26], as well as different types of branch distance heuristics. All evolutionary algorithms compared in this work use the same fitness function.
Besides testing targets based on the source code (e.g., line, statements, and branches), there are other metrics of interest for practitioners. For example, for Web APIs using HTTP, covering different returned HTTP status codes can provide a better coverage of the API. For example, you can make a correct query and receive a 200 status code in an HTTP response. For the same endpoint, you could send an invalid input (e.g., a number outside a specified range), which could lead the server to return a 400 status code (user error), although this depends on the server implementation (some GraphQL servers return 200 even in the case of errors). A request with no authentication information could return a 401. A request with authentication but no authorization (i.e., no right permissions) could return a 403. An input that leads to a crash (e.g., an exception thrown in the business logic of the API) could result in a 500 status code. And so on.
For each GraphQL endpoint, we create a different testing target for each returned HTTP status code. This enables EvoMaster to do not discard newly generated tests that cover endpoints returning different status codes (and thus showing different behaviors of the API).
When evaluating the fitness of an evolved test, besides considering testing targets related to code coverage and HTTP status coverage (for each different query/mutation operation), we also create new testing targets based on the returned responses. As discussed in Section
2.1, each response could contain either a
data field or an
errors field. For each query and mutation in the GraphQL schema, we consider two additional testing targets for those two possible outcomes. Note that a trivial way to get a response with
errors is to send a syntactically invalid query. As such evolved test cases would be of little use, we explicitly avoid generating such test cases (unless there are faults in
EvoMaster).
As automated oracles to detect faults once a test is executed, we consider two properties: returned HTTP status code 500 and responses with
errors fields. The former is a common oracle used in fuzzing HTTP-based APIs (e.g., [
58,
61]). However, it is important to keep in mind that not all 500 responses are necessarily related to software faults. For example, an API could return a 500 when unable to communicate with its database because it is down—not reachable for some technical reasons. As
errors fields might be due to
user errors besides
server errors, the users would still need to check those generated tests to see if actual faults are detected.
As a given query/mutation might fail for different reasons, we keep track of the last executed line in the business logic of the SUT. We further create a separated testing target for each combination of errored query/mutation and last executed line. Having explicit testing targets for those cases enables the search algorithms to keep those test cases, albeit the fitness function would have (currently) no gradient to lead to generate such test cases in the first place.
4.7 Black-Box Testing
We use black-box testing when we do not have any knowledge about the source code of the GraphQL API or it is not available for instrumentation (e.g., to calculate the search-based heuristics like the branch distance). It is not straightforward to get a high coverage value for such tests [
20], as little information from the SUT can be exploited. However, in some cases (e.g., when testing remote services), a black-box approach might be the only option available for automated testing. In addition, as no code analysis is performed, black-box testing can be applied regardless of the programming language the API is written in, such as Python and Ruby. However, currently, for white-box testing with
EvoMaster we are limited to languages running on the JVM (e.g., Java and Kotlin) and NodeJS (e.g., JavaScript and TypeScript).
We use the RS algorithm developed in EvoMaster. The main idea behind RS is performing a randomized process in generating the test cases, where no code-based fitness function is employed. The reason for not using search-based heuristics is due to the lack of the source code of the GraphQL APIs.
From a practical standpoint, our black-box testing is the same as RS but without code-based heuristics. Like for white-box testing, we start by fetching the schema (Section
4.2) and create a problem representation (Section
4.3) from which new test cases are randomly sampled (Section
4.4). No evolutionary mutation operator is applied here. We use the same fitness function to reward found faults (Section
4.5) but without any code metrics. At the end of the search, the final test suite is minimized, to contain only the test cases that contribute to the fitness (Section
4.6). In other words, for each query/mutation, we retain test cases that lead to different HTTP status codes, and at least one with a correct
data response and at least one with an
errors response.
Both black-box and white-box testing share the same goal of detecting faults in the tested APIs. They use the same automated oracles to detect faults. Both testing approaches are important, as they have their own strengths and weaknesses. For example, black-box testing is easier to use (e.g., it requires no setup to specify how to start the application with automated instrumentation), and it is of wider applicability (e.g., it is not restricted to any specific programming language). However, white-box testing can achieve better results (i.e., code coverage and fault finding), as it can exploit information about the source code of the API. Furthermore, its generated tests can be used for regression testing (as the generated tests can start, stop, and reset the API).
In this article, we provide and empirically evaluate both approaches, as both of them are useful for practitioners in the industry. Considering that, to the best of our knowledge, this is the first work in the literature addressing this problem, more can be done in future research. For example, our black-box approach is very basic, simply an RS on syntactically valid queries based on the schema.
4.8 Tool Support
All the novel techniques presented in this work have been implemented as part of our existing tool
EvoMaster.
EvoMaster is open source on GitHub, with each new release automatically uploaded to Zenodo for long-term storage (e.g., [
31]).
When a practitioner uses EvoMaster, they need to specify with command-line options whether they are testing a REST or GraphQL API. For example, black-box testing of an online API such as GitLab can be done on the command line as shown next.
Here, one needs to specify that we are fuzzing a GraphQL API (using –problemType) and not, for example, a RESTful one, where the API is located (–bbTargetUrl), the type of testing (–blackBox), the format of the output tests (–outputFormat), for how long to run the fuzzing session (–maxTime), and a rate-limiter (–ratePerMinute) to do not overload the tested API of requests (needed when testing APIs on the Internet to avoid denial of service). For doing white-box testing, some manual effort is needed, as there is the need to implement a driver class to specify how to start and stop the API.
Extending an existing fuzzer for a new problem domain not only requires scientific research but also significant engineering effort. What is presented in this article took 2 years of work. Considering the complexity of EvoMaster (which is currently more than 200,000 LOCs, not including tests), providing precise code metrics is not viable. Although modules specific for GraphQL can be identified (e.g., org.evomaster.core.problem.graphql with more than 4,000 lines of Kotlin code), changes were needed throughout the whole code base of EvoMaster to be able to support GraphQL. For example, the gene system of the evolutionary engine of EvoMaster needed to be extended with new genes like TupleGene. We can estimate around 10,000 to 15,000 LOCs needed to support GraphQL API testing.
To reduce the risk of publishing wrong results based of faulty software, this work has been carefully tested. For example, in unit tests (e.g.,
GraphQLUtilsTest), we parse 75 GraphQL schemas (having more than 860,000 lines), to make sure that our schema analysis algorithms do not crash and give the correct results (at least for those 75 schemas). Furthermore,
EvoMaster has a sophisticated system of end-to-end tests [
30]. We create several artificial APIs, run
EvoMaster on them, compile the generated tests, run them, and verify properties on those tests. This is all done automatically from JUnit tests (including the compilation and dynamic loading and execution of the new generated tests on the fly), and run in a Continuous Integration system (i.e., GitHub Actions) at each new Git commit (more details can be found in the work of Arcuri et al. [
30]). Due to all these end-to-end tests, the current
EvoMaster build takes more than 2 hours. For GraphQL, we currently have end-to-end tests for 39 artificial APIs (in the module
spring-graphql), covering different aspects of GraphQL, for a total of more than 6,000 LOCs.