1 Introduction
It is a common practice in industry to develop large enterprise systems with microservice architectures [
59,
68,
79]. For example, Meituan is a large e-commerce enterprise with more than 630 millions customers in China, with microservice systems like Meituan Select comprising more than 1,000 different web services. Testing this kind of systems is very complex, due to their distributed nature and access to external services such as databases. There is a dire need in industry for automation for this kind of systems.
Although in recent years there has been an interest in the research community on fuzzing REST web services [
45] (e.g., with tools like Restler [
29], RestTestGen [
67], Restest [
58], RestCT [
69], bBOXRT [
51], and Schemathesis [
47]), to the best of our knowledge there is no work in the literature on the testing of modern
Remote Procedure Call (
RPC) web services. There is a large body of knowledge in the scientific literature on the topic of software testing automation, with several successful stories in many different software testing domains [
14,
35,
44]. Addressing an important industrial testing problem for the first time does not start from scratch, especially when aiming at providing useful results for engineers in industry. It rather builds on top of decades of scientific research on the topic. On the one hand, some research challenges are similar to other domains (e.g., how to deal with SQL databases when fuzzing a web service [
24], regardless of whether it is a REST, GraphQL, or RPC-based API). On the other hand, scientific research and empirical evaluations are needed to address the specific peculiarities of each different software testing problem. For example, to the best of our knowledge, none of the existing fuzzers in the scientific literature can be directly applied on fuzzing RPC systems without major engineering and scientific effort, as, for example, the API schemas and communication protocols are different.
As part of an industry-driven collaboration [
18,
40,
41,
42,
43], when we first tried to use our
EvoMaster fuzzer [
17] for RESTful APIs on the web services developed at Meituan, we could not apply it directly [
74]. We had to manually write REST APIs as wrappers for the RPC systems (which use Apache Thrift). Not only it is time-consuming, but also the generated tests are more difficult to use for debugging any found fault. Two web services were used as a case study. Such study (with interviews and questionnaires among the developers at Meituan) pointed out few research challenges, including the need for a native support for RPC systems for web service fuzzers. Such support not only requires not trivial engineering effort (our extension to the existing fuzzer
EvoMaster required more than 10,000 lines of code, not including test cases), but also there are several scientific research challenges that need to be addressed to best handle RPC-based APIs (as we will discuss in more detail later in the article).
In this article, we provide a novel approach
1 to automatically fuzz RPC-based APIs, built on top of
EvoMaster. To adapt to various RPC frameworks, in this approach, a RPC schema is defined to formulate the API specification that could document all necessary info to make a RPC call and possible responses (e.g., throwing exception, failure). The schema of the RPC-based services can be automatically extracted from the source code with our approach. This allows one to test the services developed with different RPC frameworks. With the extracted schema, a test for a RPC-based API can be reformulated as an
individual, i.e., a sequence of RPC calls under a certain state of the API (e.g., database if it has). Thus, search-based techniques (such as the MIO algorithm [
19]) can be employed to evolve tests (e.g., seek various values of input parameters of RPC calls in order to cover more code and find more faults). To better solve our testing problems with search, we define new heuristics specific to the RPC domain. The approach was implemented as an extension of
EvoMaster and has been integrated into an industrial pipeline. To assess the effectiveness of our novel approach and its application on industrial context, we empirically compared it with a gray-box technique with two artificial and four industrial RPC-based APIs, and further reported its performances on 50 industrial APIs in real industrial settings. The main contributions of the article include:
(1)
the first approach in the literature for fuzzing RPC-based APIs;
(2)
an open source tool support (i.e., an extension to the existing fuzzer EvoMaster);
(3)
an empirical study carried out in industrial settings that involves in total 54 industrial RPC-based APIs comprising 1,489,959 lines of code (computed with JaCoCo) for business logic;
(4)
an in-depth analysis on four selected industrial APIs with our industrial partner; and
(5)
identifying lessons learned and research challenges that must be addressed before better results can be obtained.
The article is organized as follows. Section
2 provides the needed background information to better understand the rest of the article. Section
3 analyzes related work. The details of our novel approach are presented in Section
4. Our empirical study is discussed in Section
5, followed by lessons learned in Section
6. Threats to validity are discussed in Section
7. Finally, we conclude the article in Section
8.
3 Related Work
To the best of our knowledge, there does not exist any technique for fuzzing modern RPC-based APIs (e.g., using frameworks like Apache Thrift, gRPC, and SOFARPC). In addition,
EvoMaster seems the only open source tool that supports white-box testing for Web APIs, and it gives the overall best results in recent empirical studies comparing existing fuzzers for REST APIs [
49,
73]. However, currently
EvoMaster only supports fuzzing RESTful APIs [
20] and GraphQL APIs [
33].
In the literature, there has been work on the fuzzing of other kinds of web services. The oldest approaches deal with black-box fuzzing of SOAP [
36] web services, such as, for example, [
30,
32,
46,
52,
53,
56,
60,
64,
65,
70]. SOAP is a type of RPC protocol. However, SOAP’s reliance on XML format for schema definitions and message encoding has led this protocol to lose most of its market share in industry (i.e., apart from maintaining legacy systems, it is not used so much anymore for new projects).
In recent years, there has been a large interest from the research community in testing RESTful APIs [
37,
45], which are arguably the most common type of web services. Several tools for fuzzing RESTful APIs have been developed in the research community, such as, for example (in alphabetic order),
bBOXRT [
51],
EvoMaster [
16],
RESTest [
57],
RestCT [
69],
RESTler [
29], and
RestTestGen [
67]. Another recently introduced type of web services is GraphQL [
5], which is gaining momentum in industry. However, there is only little work in academia on the automated testing of this kind of web services [
33,
34,
48,
66].
The automated testing of different kinds of web services (e.g., modern-RPC, SOAP, REST, and GraphQL), shares some common challenges (e.g., how to define white-box heuristics on the source code of the SUT, and how to deal with databases and interactions with external services). However, there are as well specific research challenges for each type of web service, as we will show later in the article. A fuzzer for SOAP or REST APIs would not be directly applicable to a RPC web service, and vice versa.
In the literature, there are many applications of scientific research on the automation of software testing [
14,
35,
44]. Popular examples are AFL [
1] for parsers and Sapienz for mobile applications [
54]. Albeit possible, extending these kinds of tools from other testing domains for fuzzing RPC APIs would likely require major engineering and scientific effort. Other domains like fuzzing network protocols (e.g., AFLNet [
62]) and network devices (e.g., NDFuzz [
78]) are closer to the fuzzing of Web APIs. Still, a non-trivial amount of work would be needed to adapt them to white-box fuzzing of RPC-based APIs. For example, this could also explain why, to the best of our knowledge, none of these existing tools has been used so far to fuzz RESTful APIs, albeit their recent popularity in academia.
Given a client library for a RPC-based API, a unit test generator could be used directly on it, such as, for example, the popular
EvoSuite [
38] for Java classes. This might work if the SUT and the client library are run in the same JVM. However, all the issues when dealing with system testing of web services would still be there, e.g., how to deal with databases and what to use as test oracle. Also, likely such unit testing tool would need some modifications (e.g., to collect coverage from all the classes and not just the RPC-client one). Therefore, how a unit test generator could be adapted and fare in such a system testing scenario is an open research question.
4 Fuzzing RPC-based APIs
When addressing a new testing problem like the fuzzing of RPC-based APIs, several design decisions need to be made, especially when using search-based techniques. There is the need to specify the
search space (Section
4.1), how to represent the
genotype of an evolving individual (i.e., a test case in this context) (Section
4.2), how to define the
fitness function to guide its evolution (Section
4.3), which
search operators to employ to modify the evolving individuals (Section
4.4), and how to
output the final results to the user (Section
4.5).
Building a fuzzer that can scale and be used on tens of industrial systems requires major engineering efforts, throughout a few years. To evaluate the novel techniques presented in this article, we did not start from scratch, but rather re-use and extend an existing open source fuzzer. In particular, our novel approach is built on top of
EvoMaster (recall Section
2.3).
Figure
5 represents an overview of our novel approach. In order to fuzz RPC-based APIs, we propose
RPC Schema specification, which formulates necessary info to allow the execution of RPC function calls and the analysis of execution result. In addition, with the specification, as shown in the figure, the approach is composed of six steps distributed between the
driver and
core of
EvoMaster, plus initial settings manually provided by the user, for enabling automated fuzzing of RPC-based APIs with search techniques. We briefly summarize these steps, where their details will be provided in the rest of this section.
To employ
EvoMaster, a
SUTdriver is required to be specified for implementing how to start/stop/reset the SUT (recall Section
2.3). In the context of RPC-based API testing, in the
SUTdriver, we further need the user to specify (1)
RPCInterfaces: what interfaces are defining the API in the SUT with their class names and (2)
RPC clients: the corresponding client instances used to make RPC calls during test generation (Step 0). Then, with the specified interface info,
RPC Schema Parser will extract and identify the API schema based on proposed
RPC Schema specification, in order to access the RPC functions (Step 1). At the
core side, the extracted schemas will be further reformulated (Step 2) to be as components (i.e.,
RPC Actions and
Genes) of the search for producing tests (Step 3). In our approach, a generated test is evaluated by its execution on the SUT (Step 4) performed on the
driver side. Then, responses, SBST heuristics (e.g., code coverage with code instrumentation), and identified potential faults resulted in the execution will be returned to the
Fitness Function (Step 5) for calculating the fitness value of the executed test. Producing and evaluating tests are performed iteratively (i.e., Steps 3–5), within a given search budget. At the end of the search, a set of the best (in terms of code coverage and fault detection) tests for the RPC-based SUT will be outputted (Step 6) with a given format (e.g., JUnit 5).
4.1 Search Space
At a high level, a RPC-based API can be seen as a process that opens a TPC/UDP port on a given host, and then replies to incoming messages formatted with a given application-layer protocol. Such protocol could vary among the different RPC implementations. Furthermore, the API would reply only to requests for its defined methods, requiring the right number and type of input parameters. This means that sending random bytes over the TCP/UDP connections would unlikely result in any meaningful response from the API, and possibly no execution of the code of its business logic.
To address this issue, it would be important for a fuzzer to send well-formatted messages for the different remote APIs exposed by the web service. Given a schema that specifies which methods can be called, a fuzzer can then generate calls with the right input parameters. Considering that these methods can take as inputs complex data such as strings, objects, and arrays, the search space of possible inputs is huge, even when using a schema to constrain what will be sent. Only with some specific inputs, faults could be revealed and code coverage optimized. Furthermore, to test a specific endpoint, there might be the need to call a previous one to set the state (e.g., a database) of the API. This means that a test case would hence be a sequence of one or more remote calls toward the API, which increases the search space even further. To complicate this even further, to achieve higher code coverage the API might require setting up the environment in which it operates. For example, advance fuzzers can also add data directly into SQL databases as part of an initialization phase, based on what queries the API executes on the database. This further extends the search space of possible test cases that the fuzzers need to explore.
Nowadays, there exist various RPC frameworks for building modern RPC-based APIs, e.g., Thrift [
13], gRPC [
6], Dubbo [
2], and SOFARPC [
10]. As discussed in Section
2.1, most of the techniques would result in
RPCInterfaces (e.g., implemented as
interface or
abstract class) in their API implementations representing how the services can be accessed, together with a client stub to make the actual RPC calls. Considering all the possible types of communication protocols supported by the different RPC frameworks, calling a RPC API directly from a fuzzer would be a major technical endeavor. Furthermore, it would require one to support the different schema languages for each framework, such as, for example
.thrift (see Figure
2(a)) and
.proto (see Figure
3(a)) formats, and there would be limitations when the schema file might not be available, such as the APIs implemented with SOFARPC and Dubbo.
In order to enable automated testing of RPC-based APIs in a more generic way, in this article we propose a schema specification specific to the RPC domain that formulates main concepts for facilitating invocations of RPC function calls and result analysis. Such a specification can be automatically extracted based on RPCInterfaces, regardless of which RPC framework is employed by the API. This schema defines the search space for the fuzzing, as we will evolve test cases complying with such schema. Then, we employ the actual client libraries of the APIs to make the RPC calls.
4.1.1 RPC Schema Specification.
Our RPC Schema is defined with a
Data Transfer Object (
DTO), which can then be instantiated in different formats, such as, for example, JSON. Figure
6 shows our RPC schema specification with a UML class diagram.
To extract info for enabling invocations of RPC function calls, there exist five main concepts to define
RPCInterfaces (denoted as classes with white background in Figure
6):
—
RPCInterfaceSchemaDto: it represents the
RPCInterface, such as the
Interface with Thrift (see Figure
2(b)) and
abstract class with gRPC (see Figure
3(b)). A
RPCInterfaceSchemaDto comprises one or more
RPCActionDto (see
1.* functions), a set of functions for authentication handling (see
* authFunctions) and a set of specifications of data types (see
* types). For instance,
NcsService.Iface interface has a
bessj function and employs
Dto data structure (as shown in Figure
2(b)). Note that a RPC-based API might have multiple interfaces as industrial APIs, which we studied in this article.
—
RPCActionDto: it captures information to make a RPC function call, i.e., input parameters if they exist (see * requestParams) and additional authentication setup (see 0.1 authSetup). Each RPCActionDto also has interfaceId, clientInfo, and actionName properties to identify the RPC function to call. In addition, we identify a property isAuthorized representing whether the RPCActionDto is restricted with authentications in its implementation.
—
ParamDto: it is used to describe values of input parameters and return. A ParamDto links to an explicit datatype (see type) and might be composed of a set of ParamDtos for representing complex data types, such as object, collection, and map (see * innerContent). The ParamDto might be specified with a default value (see 0.1 default), e.g., a field in a DTO can be assigned with a default value. In addition, we define stringValue to assign a value for the input parameter or represent the actual value of the return. Note that stringValue is applicable only if there are no any internal elements. To construct constraints of the input parameters if they exist, we define a set of properties in ParamDto as follows:
—
isNullable represents whether the parameter is nullable to make the call.
—
isMutable indicates whether the parameter is mutable. A value of the property is derived based on whether the parameter is assigned with a fixed value. For instance, a parameter must be true if it is specified with @AssertTrue, thus the parameter is considered as immutable.
—
minSize and maxSize represent boundaries in size if specified. The constraint could be applicable to data types, i.e., collection, map, array, and char sequence (e.g., string).
—
minValue and minInclusive are used to represent a minimum value, and a value of the parameter must be higher than or equal to the minimum.
—
maxValue and maxInclusive are used to represent a maximum value, and a value of the parameter must be lower than or equal to the maximum.
—
precision and scale capture constraints for numeric values regarding its precision and scale (e.g., number of digits in their decimal part).
—
pattern represents a regular expression that a string value must match.
Such captured constraints could contribute to test data generation for fuzzing Web APIs, by sampling values within the boundaries of these constraints. Values of all of the constraint properties could be derived automatically based on the
PRCInterface, which is explained in the
RPC Schema extraction (see Section
4.1.2).
—
TypeDto and RPCSupportedDataType identify the data type info of the ParamDto. A list of data types we support is defined as an enumeration RPCSupportedDataType that covers the most commonly used data types, i.e., array, byte buffer, date, enumeration, list, map, set, string, integer, Boolean, double, float, long, character, byte, short, big integer, big decimal, and any customized DTO object, for enabling the fuzzing of RPC-based APIs. In TypeDto, it can be specified with an example (see 0.1 example) for representing a generic type of collection, array, and map. Note that this list of supported data types is not meant to be complete for all RPC frameworks. But, if needed, it can be extended.
4.1.2 RPC Schema Extraction and Execution Support.
As a white-box fuzzer, besides the source code of SUT, a
SUTdriver is the only input that
EvoMaster needs a user to specify (recall Section
2.3). Then, the
SUTdriver is employed at the
driver side for, e.g., starting/stopping/resetting the SUT. In the context of RPC-based API fuzzing, we further need the user to provide info of
RPCInterfaces and corresponding client instances for extracting the API schema and accessing the SUT. As shown in Figure
5, in the
driver, with the provided
SUTdriver (Step 0), we developed a
RPC Schema Parser, by directly extracting the interface definitions (which do represent the API schema) from the source code using a
reflection technique, such as Java Reflection.
4 Thus, with any RPC framework, if the available RPC functions are defined as an
interface/
abstract class (which is usually the case), our approach could be applicable. The extracted information is further formulated as a generic
RPC Schema (see Section
4.1), i.e., a
RPCInterface will be formulated as a
RPCInterfaceSchemaDto that contains specifications to invoke RPC function calls (i.e.,
RPCActionDto (Step 1
\(\rightarrow\) Step 2)). In addition, we developed
RPC Test Client, which allows one to make a RPC function call against the SUT with
RPCActionDto, then return
ActionResponseDto (Step 5
\(\leftrightarrow\) Step 4) using specified RPC client instances. The
driver is implemented as a service using REST API, and the two components (i.e.,
RPC Schema Parser and
RPC Test Client) are exposed as two HTTP endpoints, i.e.,
/infoSUT for extracting RPC API schema and
/newAction for executing RPC function calls. Thus, with a provided
SUTdriver, our
driver employed with proposed
RPC Schema would allow a unique interface of our tool to support invocations of RPC functions and result analysis. This is an essential prerequisite for fuzzing RPC-based API.
Note that instead of enabling RPC function execution at the driver side, an alternative approach would have been to include the two components and RPC API client library directly into the core process, which might be more efficient (as calls from the core do not need to go through the driver with HTTP requests). But that would introduce a lot of usability issues to configure it up (e.g., how to dynamically load a library at runtime, and how to deal with different JVM versions and different programming languages). When introducing a novel approach, it is important to take into account how complex it is to set it up by practitioners. For this, industry collaborations, where actual engineers use these techniques on their systems (as we do for this article), are paramount.
4.1.3 SUTdriver Implementation.
Figure
7 represents an example of a
SUTdriver for manipulating the SUT and specifying the info of a RPC-based API. For instance, a
startSut method at lines 9–24 represents how to start a RPC-based SUT, which is implemented with the Thrift framework and SpringBoot. The method also instantiates needed clients to access the SUT after it starts (see lines 18–20). To provide info specific to the RPC problem, lines 29–31 specify the
RPCInterface (i.e.,
NcsService.Iface; see Figure
2(b)) and corresponding client instance. Note that the info is specified with a map since an API might have multiple
RPCInterfaces as we observed in our industrial case studies.
In addition, each framework or each company might define their own rules to represent results. For instance, we found that, in our industrial case study, in most cases a failed function call would not result in any exception thrown to avoid propagation of exceptions in the distributed system, since the services are connected with each other. Thus, inside the response, our industrial partner has its own customized specification to reflect the results of RPC function calls that are linked to their business logic. Without a thrown exception, a response representing an error might be falsely identified as a success if no further info is provided. To address this concrete issue in industrial APIs, in our approach, we provide an extensible method (i.e.,
getCustomizedValueInRequests at line 35) to enable customized categorization of responses with the three levels as
CustomizedCallResultCode defined in our
RPC Schema (Section
4.1). By extending the method, the user could directly link their own rules into our testing context. Note that such setup can be easily reused by multiple SUTs if they use same customized specification (as it was for all web services developed by our industrial partner).
As an enterprise system, authentication is typically required to be handled. However, there are many different ways to implement an authentication system in a RPC API, as it is usually not supported natively (at least not in Thrift). For this article, we are mainly supporting the authentication systems used by our industrial partner. Authentication tokens need to be sent as a field in payloads of the messages (similarly as HTTP authentication headers in RESTful APIs). An authentication token can be either
static (i.e., pre-fixed) or
dynamic. The latter requires one to get the token from an endpoint (e.g., a
login RPC endpoint where valid username/password info must be provided), and then add it to all following RPC calls. In our implementation, we support both approaches, which needs to be configured in the
driver, i.e., by extending the method
getInfoForAuthentication at line 38 and
getCustomizedValueInRequests at line 41 as shown in Figure
7. To serve a more fine-tuned setup for authentication, we enable options to specify (1) if either the authentication is applied for all API functions; or (2) specific only to some functions in that SUT, that could be filtered by names or by special annotations applied on these functions. More details about how to configure the option could be found in two DTOs, i.e.,
JsonAuthRPCEndpointDto and
CustomizedRequestValueDto, in our implementation.
14.1.4 RPC Schema Parser.
Regarding the extraction of RPC interface definitions, currently, we target JVM RPC-based APIs using Java Reflection. As for the examples shown in Figures
2 and
3, a client-stub
RPCInterface is composed of a set of available RPC functions to be extracted. Each operation in the interface depicts a RPC function to be called in this service. Then, with reflection, for each interface, we identify all such public methods, and then further extract info on their input parameters, return type, and declared exception types.
Regarding datatype, as currently targeting JVM projects, we have supported the most commonly used data types, i.e., Array, ByteBuffer, Date, Enum, List, Map, Set, String, Integer, int, Boolean, bool, Double, double, Float, float, Long, longCharacter, charByte, byte, Short, short, BigInteger, BigDecimal, and any customized DTO object. For the handling of generics, we support their instantiations for any of these common data types. Note that all of the datatype could be mapped to an item defined in RPCSupportedDataType.
Regarding the parameter, besides its datatype, we also need to extract info, such as accessibility and constraints if they exist. Extracting accessibility is needed for the parameter typed with DTO, then its fields might be publicly accessible or not, i.e., declared as
public or not in Java. If the field is not publicly accessible, there is a need to further extract its existing getter and setter that would be used in assertion generation (with getter) and parameter construction (with setter) in our context. Note that the accessibility info for each parameter is maintained inside the
RPC Test Client that does not expose in DTO, since the user does not need to care about how to construct the data instance for the parameter and assertion generation. More details about how the info is constructed can be found in the class
AccessibleSchema.
1Regarding the constraints, a parameter might be specified with constraints in its implementation. For example, an integer representing the day of the month could be constrained between the values 1 and 31. To make RPC function calls that do not fail due to input validation, we need to handle such constraints when generating input data for the call. Therefore, for each parameter, with the proposed schema, we define possible constraints as properties of
ParamDto (see Section
4.1.1 and Figure
6). With the extraction, we further identify the properties based on the data types. For instance, all parameters are defined with a property named
isNullable representing whether a parameter object can be
null (the value of this property for all primitive types is always false). Parameters with numeric data types are defined with
min and
max properties. For parameters representing collections (e.g., maps and lists) and string types, properties for constraining their size/length are defined, i.e.,
minSize and
maxSize. For strings, we define
pattern for supporting a constraint specified with regular expressions. If a string has to represent a numeric value, we use
minValue and
maxValue for supporting a possible range constraint for it.
To identify constraints defined in the interface definitions (typically with annotations), we enable constraint extraction on
javax.validation.contraints [
8], which is the standard library for defining built-in constraints for Java objects. We support 16 commonly used constraints, i.e.,
AssertFalse,
AssertTrue,
DecimalMax,
DecimalMin,
Digits,
Max,
Min,
Negative,
NegativeOrZero,
NotBlank,
NotEmpty,
NotNull,
Pattern,
Positive,
PositiveOrZero, and
Size. Besides standard
javax annotations, constraints could be defined in other ways as well. For instance, in Thrift, whether a field is
required is represented by a
requirementType property of the
FieldMetaData class. Thus, in order to deal with constraints in the Thrift framework, we further extract and analyze the
metaDataMap object in the interface for obtaining such constraints.
In addition, since there is no general standard to restrict such interface implementations (as long as it compiles), the method and the data type might use Java Generics (as we found in our industrial case study). Therefore, we further handle such generic types when processing RPC function extraction, e.g., analyze getParameterizedType for each parameter.
With the
RPC Schema Parser, it is capable of formulating each
RPCInterface as
RPCInterfaceSchemaDto shown in Figure
6.
4.2 Genotype Representation
Given an extracted RPCSchemaDto schema, we need to define how to represent the genotype of the evolving test cases. In our context, a test could be reformulated as an individual which is composed of a sequence of RPC function calls. Each function call is formulated as RPCCallAction, which comprises the method name, input parameters (if any), optional authentication info, and a response (if declared).
For each input parameter, we define a gene with a specific type to represent the parameter. A gene is an instance for the specific type, with constraints on how it can be mutated (i.e., modified by the search operators) during the search. For example, a numerical parameter could be internally represented as an integer, initialized with a random value, where the search operators could add or move a delta from such value during its evolution. Textual parameters could be represented with a string, where search operators can either modify its characters, and add or delete some of them (and so changing the length of the string). For more complex types, genes can be hierarchically combined in a tree structure. For example, an object is represented with a gene that has one child gene for each field of the object (and so on recursively, if any of these child fields is an object itself).
There are many types of possible parameters to handle. To achieve a full support to be able to handle RPC-based APIs, we re-use (and extended where needed) the gene system already present in our EvoMaster fuzzer. Regarding the input parameters, we could re-use existing Gene objects already defined in EvoMaster for supporting REST API testing, such as, for example:
—
Straightforward mapping: ArrayGene for Array, Set and List; BooleanGene for Boolean and bool; DoubleGene for Double and double; LongGene for Long and long; FloatGene for Float and float; EnumGene for Enum; DateGene, DateTimeGene and TimeGene for DateTime;
—
MapGene for Map. Note that the original version of MapGene only supports keys with string type. However, other types such as enum and integer are quite common in RPC-based APIs. Therefore, we further extended MapGene for enabling key to be specified with IntegerGene, StringGene, LongGene, and EnumGene.
—
IntegerGene for Integer, int, Short, short, Byte, and byte (various types here are distinguished by min value and max value, e.g., max value is configured as 127 for Byte by default if it is not constrained);
—
StringGene for Character, char, String, and ByteBuffer (various types are distinguished by min and max length, e.g., max length is configured as 1 for char by default if it is not constrained);
—
RegexGene for a pattern specified in String parameter;
—
ObjectGene for representing customized class object;
—
CycleObjectGene for a field in the customized class object that leads to a cycle;
—
Optional is for handling any parameter whose isNullable property is true.
In addition, we also purpose new genes, such as BigDecimalGene and BigIntegerGene for BigDecimal and BigInteger, respectively. In the original implementation of Genes in EvoMaster, constraints for all types are not fully supported. Therefore, to fully support testing the RPC APIs in our case study, we extended genes by enabling all constraints we defined in RPC Schema, such as handling precision and scale for numeric genes, and min and max size constraints for ArrayGene, MapGene, and StringGene. This means that, when these genes are either sampled at random, or modified throughout the search via mutation operators, all (linear) constraints are kept satisfied (e.g., a mutation operator would not try to increase a numeric value if it is already at its maximum as defined in its gene constraints).
To test a RPC-based API, the input parameters could be either automatically generated or manually configured by the user (e.g., unlike header in HTTP request, in RPC function call, authentication info could be specified as parts of input DTOs). The former one would be handled by search techniques in our approach. The way to enable authentication as part of the input parameters can be identified as the latter option, i.e., manually defined inputs. To allow further combinational handling with both automatic and manual solutions, we decided to extend the test reformulation with a new gene, i.e., SeededGene, for handling manual inputs in a more generic way. A SeededGene, representing a gene that has a set of candidates, is constructed with the following: (1) gene is the original genotype of the parameter that could be mutated with the search; (2) seeded is an EnumGene with the same type as gene presenting enumerated candidates; and (3) employSeeded is a Boolean to indicate whether the original gene or the seeded gene is used for the phenotype of SeededGene. Besides handling authentication info, this kind of gene also allows further seeding with existing data (if any) that would be useful, in particular, in solving industrial problems.
To be able to efficiently fuzz real-world APIs, currently
EvoMaster has more than 80 different types of genes in its search-based fuzzer engine [
28]. A full description of each of them is not viable here. For low-level technical details, the interested reader can check out our implementation [
28], in particular, the code under the
org.evomaster.core.search package.
4.3 Fitness Function
To evaluate the fitness of a test case, we need to be able to make calls toward the API, with the right inputs, in the right format. The fitness itself will be based on two different kinds of metrics: white-box heuristics based on the execution in the source code (which requires the API to be instrumented with probes), and black-box heuristics based on the responses returned from each RPC call.
To make calls on the API, we use the
client library provided by the API itself (recall Section
4.1.2). Most RPC frameworks (e.g., Thrift and gRPC) provide ways to automatically generate client libraries (recall Section
2.1). However, there would be several technical issues in dynamically loading such library inside the core process of
EvoMaster. Our solution is to let the user specify (and link) such client libraries in the
driver classes that need to be written to run the white-box mode of
EvoMaster (recall Section
2.3). This means that, when a test case needs to be evaluated, the
core process sends a representation (in JSON format) of such a test case to the
driver, and then the driver executes the actual RPC call and collects its response. Plus, the driver also collects any white-box heuristics from the instrumented API. Then, all this information is sent back to
core, where the fitness value for the test is computed.
This architecture to support fuzzing of RPC APIs introduces some latency, as the EvoMaster core does not communicate directly with the API. However, it has major benefits, as it is much easier to set up and implement (e.g., there is no need to parse any .thrift or .proto file), as well as enabling supporting all different kinds of RPC frameworks with little effort.
4.3.1 RPC Execution Result Analysis.
By using client to invoke RPC function call, a result received at the client side could be a return value as defined or an exception thrown from the API. To enable the result analysis, we proposed five main concepts denoted as classes with gray background in Figure
6, i.e.,
ActionResponseDto,
RPCExceptionInfoDto,
RPCExceptionCategory,
RPCExceptionType, and
CustomizedCallResultCode.
ActionResponseDto is a DTO that captures all info returned from a RPC function call, i.e., throw an exception (see
0.1 exceptionInfoDto) or return a value as specified (see
0.1 response).
Regarding exception, handling the exception info for RPC functions is crucial for testing purposes, e.g., to be able to use automated oracles to identify faults in the SUT. To analyze an exception, in our proposed schema, we define
RPCExceptionInfoDto, which captures
exceptionName,
exceptionMessage,
type, and
exceptionDto, which is an optional DTO representing possible additional info for customized exceptions (e.g., the exceptions declared with the keyword
throws in Java). In addition, when invoking RPC function calls with clients that could be proxy clients, an exception caught at the client side might be wrapped, such as
UndeclaredThrowableException5 in Java. To get the exact exception info, we further extract and analyze the actual exception (e.g., with
cause of
UndeclaredThrowableException) as
RPCExceptionInfoDto. We also perform further exception analysis on
UndeclaredThrowableException, as was needed for our industrial case study, and the property
isCasueOfUndeclaredThrowableException represents whether such a wrapped exception is thrown from the SUT. Note that the actual exception analysis could be extended in the future when needed.
Besides
exceptionName and
exceptionMessage, to better identify exceptions in the context of RPC-based APIs, based on domain knowledge, we classify exceptions into four categories as
RPCExceptionCategory:
APPLICATION (e.g., internal server errors),
TRANSPORT (e.g., connection timeouts),
USER (e.g., sending invalid data), and
UNCLASSIFIED. Different RPC frameworks can define their own exceptions for handling various situations for RPC (e.g., type of
TApplicationException [
12] defined in
TException for Thrift, status [
11] defined in
StatusException and
StatusRuntimeException for gRPC). To cover such knowledge captured in various RPC frameworks, we define
RPCExceptionType, and each of the types should belong to a category in
RPCExceptionCategory. The
RPCExceptionType now provides full support for analyzing exceptions in the Thrift framework, which covers the complete 24 exception types from
TApplicationException (refer to
APPLICATION category),
TProtocolException (refer to
USER category), and
TTransportException (refer to
TRANSPORT category). In addition, we define two generic exception types, i.e.,
CUSTOMIZED_EXCEPTION representing a declared exception (e.g.,
throws keyword in Java), and
UNEXPECTED_EXCEPTION representing an exception that is not declared in the function and does not belong to any other identified types (e.g.,
RuntimeException in Java). The generic exception types link to the
UNCLASSIFIED category that covers the cases whereby the exception type is unspecified or its identification is not supported yet for linking it to a specific RPC exception (like Thrift).
With ActionResponseDto, considering how a RPC call is handled by the SUT and if there is any exception, we classified it into seven kinds of execution results that would contribute to define search heuristics for optimizing generated tests:
—
(ER1) internal error: an exception that represents an internal error is thrown, e.g., TApplicatinException with INTERNAL_ERROR type in Thrift.
—
(ER2) user error: if an exception was thrown that can be traced to a failed input validation, based on Thrift’s protocol errors.
—
(ER3) transport error: an exception that represents transport errors is thrown.
—
(ER4) other exception: other exception (e.g., other types of TApplication except internal error) is thrown.
—
(ER5) declared exception: an exception declared in the function is thrown.
—
(ER6) unexpected exception: any other exception that is not declared in the function is thrown.
—
(ER7) handled: a value is returned as declared without any exception thrown. If users specify their result categorization, this label is further refined as one of success, service error, and other error.
With HTTP, a result for a request could be identified based on
status code in its response, e.g., 2xx indicates a success, 4xx indicates a client error, and 5xx indicates a server error. Such a standard is useful in developing automated testing approaches, e.g., reward requests with 500 status code (for finding potential faults in the SUT) and 2xx status code (for covering a successful request). However, in the context of RPC, there does not exist such standard, and a result (e.g., success or failure) of the call cannot be directly determined based on the return value if there is no exception thrown. Therefore, we propose
CustomizedCallResultCode, which defines three categories (i.e.,
SUCCESS,
SERVICE_ERROR, and
OTHER_ERROR) to better identify a return value of a RPC function call. Identifying the return value could vary from SUTs to SUTs, and from companies to companies. So, we expose an interface to allow a customization of the identification (see Section
4.1.2).
Thus, with our RPC result analysis specification as shown in Figure
6, each result by a RPC function call would be constructed as an instance of
ActionResponseDto. If there is an exception thrown,
RPCExceptionInfoDto could be instantiated to describe info of exception in detail, such as exception class, message, type, and category. If a value is returned as defined, the value could be represented as a JSON object (if could) and an instance of
ParamDto, and the result could be further identified with
CustomizedCallResultCode.
4.3.2 RPC Test Client.
This component mainly enables invocation of RPC function call with
RPCActionDto, analysis of the response or exception after the invocation, then outputting
ActionResponseDto. With the
RPCActionDto, we could know what interface the action belongs to and what parameters are needed to construct, then the invocation is made based on the provided RPC client instance. Result analysis is performed based on concepts we discussed in Section
4.3.1. For instance, now we support extract name and message info from all exceptions that inherit from
java.lang.Exception. Its explicit type could be identified if it belongs to the Thrift framework, i.e.,
org.apache.thrift.TException can be found in the client library, the class of the thrown exception inherits from
TException, then extract its super classes to recognize the exception category (e.g.,
APPLICATION) under
RPCExceptionCategory and its
type property to identify a type (e.g.,
APP_INTERNAL_ERROR) under
RPCExceptionType. If the exception is not from the Thrift framework, its explicit class would be extracted, then set it with
CUSTOMIZED_EXCEPTION and
UNEXPECTED_EXCEPTION based on whether the exception is a part of the throw clause declared in the RPC function. Note that the result analysis needs to be extended if one wants to support other RPC frameworks, such as gRPC. However, exceptions in the context of the RPC domain have been formulated in our schema. The additional work would be only technical details that we need to cope with, e.g., add additional types if they are not covered yet, then extract the specific info to identify the type.
More technical details on this implementation (e.g., how the parameters could be constructed for each data type, how to automatically recognize input parameters with customized info, and how to extract data and type from a Java object) can be found in our open source repository.
14.3.3 Test Execution.
With our RPC handling support in the
driver, we enable tests to be executed during the search. Then, with the JVM instrumentation provided by
EvoMaster, various SBST heuristics (e.g., code coverage, branch distance, and SQL queries heuristics) can be returned after the test is executed (see
JVM Instrumentation \(\rightarrow\) Fitness Function in Figure
5), additionally to the RPC function call execution results (i.e.,
ActionResponseDto). Regarding the authentication handling, dynamic tokens acquired via a login endpoint can be regarded as an additional action that needs to be invoked before the other RPC functions can be called. This has been enabled automatically in our implementation.
For white-box heuristics, we rely on the current state-of-the-art in white-box fuzzing of Web APIs given by
EvoMaster [
50,
73]. This includes adaptation of traditional SBST heuristics like the
branch distance [
20], as well as advanced
testability transformations [
26] and
SQL handling [
24].
In the context of testing RPC-based APIs, besides using SBST heuristics at code coverage level, we propose additional novel testing targets (with their heuristics) on the responses of the RPC calls for guiding the test case generation, as shown in Table
1. Note that, with MIO, each testing target has a fitness value between 0.0 and 1.0, where a higher value is better. A value with 1.0 means that the target is
covered, and any value more than 0.0 but less than 1.0 indicates that a testing target is
reached but not
covered.
For each RPC function, we create two testing targets
Handled and
Error, representing that the call is handled or in error, respectively, by the SUT. Based on the execution results we reformulated in Section
4.2, we set a fitness value of
Handled and
Error testing targets as #1–#5 in Table
1, after the call is executed. For instance, if the execution result is identified as
handled, fitness values are set as 1.0 for
Handled and 0.5 for
Error (0.5 here represents the target is
reached but not
covered, which is heuristically better than not calling the method at all). If any unexpected or declared exception is thrown, the fitness values are set as 0.5 for
Handled and 1.0 for
Error. Since the exception type for the unexpected/declared exceptions is unclear, the execution would be further rewarded with a testing target for potential fault finding. If the exception type could be further identified, the fitness values of
Handled and
Error would be handled as #1–#3. Note that, for these three types of categorized exceptions, only
internal error is rewarded for potential fault finding. Considering that the
protocol error typically refers to user errors, compared with other exceptions, it would be less important; then it is set with lower fitness values (i.e., 0.1) for both
Handled and
Error. As
transport error (ER3) is usually due to issues in the testing environment (e.g., timeouts), we do not reward such exception with any fitness values.
In addition, if the handled results could be further categorized by the user in terms of their business logic, we propose two additional testing targets Success and Fail, representing whether the request succeeds or fails to be performed on the SUT. Heuristics for handling the two targets regarding execution results are defined in #6–#8. The strategy to decide the fitness values is similar with Handled and Error (e.g., server error is rewarded with potential finding and other error is recognized as less important) that aims at covering both Success and Fail of the RPC function actions in terms of business logic. Moreover, to maximize response coverage, we also propose another four testing targets by considering whether any null or non-null value is ever returned (i.e., #9 and #10), and whether any empty or non-empty value is ever returned for collection datatypes (i.e., #11 and #12). Note that, although some of these fitness values do not provide much gradient for the search (e.g., only two values such as 0.5 and 1), they are still useful. Test cases for reached but not covered targets (e.g., 0.5) are kept in the archive of MIO, and will still be sampled and mutated throughout the search.
4.4 Search Operators
Our test reformulation enables its use in various search algorithms for supporting RPC-based API fuzzing. In this work, we use MIO because it is the default in
EvoMaster, as it achieved the overall best results in an empirical study conducted by comparing it with various other algorithms [
15] on the fuzzing of RESTful APIs (recall Section
2.3). However, other search algorithms might be better on the problem of fuzzing RPC APIs. But, without further empirical analyses, this is not something that can assessed for sure. Due to the high cost of running this type of comparison experiments, a comparison of different search algorithms for fuzzing RPC-based APIs is not in the scope of this article.
MIO is an evolutionary algorithm inspired by (1+1) EA that uses two search operators, for sampling and mutation, respectively. We employ the same strategies as EvoMaster for RESTful API testing. The sampling is implemented to produce a valid test by selecting a sequence of one or more available actions at random. Values of Genes in these tests are initialized at random, within the constraints, if any (e.g., a ArrayGene will have n randomly generated elements based on its min and max length). Authentication info, if any, is enabled with a given probability, i.e., 95%, which is the default one used in EvoMaster. In addition, at the beginning of the sampling, we also prepare a set of ad hoc tests that cover all available RPC function calls and all authentication combinations, i.e., each test has an action configured with and without authentication. In other words, the structure of the first \(k\) tests are not sampled at random, where \(k=a\times n\) , with \(n\) being the number of functions in the RPC API and \(a\) being the different authentication settings.
Regarding the mutation operator, actions in a test can be added or removed for manipulating the structure of the test, given a certain probability. To mutate the values of
Genes inside the tests, we employ the default value mutation in
EvoMaster, which has been integrated with
taint analysis [
26] and
adaptive hypermutation [
71]. How each gene is mutated depends on its type and constraints (if any), as previously discussed in Section
4.2.
Given a typical evolutionary algorithm with an individual representation having
\(n\) bits, then on average each gene would be mutated with probability
\(1/n\) . However, the genes defined in
EvoMaster can have massive differences in terms of their genetic information. For example, a Boolean gene would represent only two possible values (for
true and
false), whereas an object gene for a complex DTO could have hundreds of internal fields. The search engine of
EvoMaster can deal with genes of different
weight, and mutate the ones with more weight more often. Furthermore,
adaptive hypermutation [
71] enables having a higher mutation rate, and automatically detects which genes have less (or no) impact on fitness, and automatically mutates them less often.
If the SUT interacts with a SQL database, genes to represent
INSERTION operations will be automatically added to the tests, in the same way as done in
EvoMaster for RESTful APIs [
24].
4.5 Test Suite Output
In the same context of API testing, we could re-use parts of EvoMaster test writer to generate the SUT test scaffolding. For example, we use the same initClass for setting up the necessary testing environment (e.g., start SUT), tearDown for performing a cleanup after all tests are executed (e.g., shutdown SUT), and initTest for resetting the state of the SUT for making test executions independent from each other. To enable a more efficient test execution and fit industrial-scale API testing, we extended initTest with our smart database clean procedure, by considering the union of all accessed tables, and their linked tables, for all tests that are generated.
Regarding handling of action execution and assertion generation, with
EvoMaster, tests are generated with
RestAssured to make HTTP calls toward the tested REST API. This is not applicable in the context of RPC testing. Then, to support RPC-based API testing, we develop a
Test Writer that could handle instantiation of input parameters, RPC function call invocation (based on the RPC client library), and assertions on response objects with JUnit. An example of generated tests can be found at this link.
6In our industrial case study, we found that some responses contain info such as timestamps and random tokens, and they could change over time. In order to avoid test failing due to such flakiness, we defined some general keywords (e.g., date, token, time) to highlight those cases. If any keyword appears in either datatype, field name, or value with string type, the assertion would be commented out to avoid the test becoming flaky. We comment them out instead of removing them completely since it would still be interesting, for the users, to show what the response was originally.
In addition, there might exist quite large responses in some API endpoints, especially when dealing with collections of data. For example, in one SUT in our case study, a response contained 470 elements, and each element further contains data with list type, and 7,579 assertions were generated for this response. As such a large number of assertions would reduce the readability of the tests, we then developed a strategy to randomly select only
\(n\) (e.g.,
\(n=2\) ) elements from the returned collections to generate assertions on in the tests. More details on the writer can be found in our open source repository.
1Generating this kind of tests has two main advantages. First, as the generated tests are self-contained (because they are able to start and stop the API directly without manual intervention), they can be used for regression testing. Second, they help debugging any found fault, as each generated test can be run independently, because they take care of initializing and reset the state of the API (e.g., SQL databases). This feature was critical when analyzing the faults found during our empirical study.
6 Lessons Learned
Automated testing requires a reset of the SUT; however, it is challenging to reset the state of a real industrial API. To enable the generated tests to be used for regression testing, and to properly evaluate the fitness of each test case in isolation, it is needed to execute every test with the same state of SUT (i.e., test case executions should be independent from each other). Thus, it requires one to perform a state reset of the SUT before a test is executed on it, e.g., clear all data in the database or reset databases to a specific state. With open source case studies, it is trivial, e.g., clean data in database. For instance,
EvoMaster provides a utility
DbClearner for facilitating the cleaning of data for various types of SQL databases, e.g., Postgres and MySQL. Such a clean on the database does work fine for small-scale applications. However, in large-scale industrial settings, cleaning all data in the database is quite expensive, even when the database is empty. For instance, in one of the industrial APIs used in this article, it takes 5.3 seconds to clean an empty database, and it takes more time if there exist data. Thus, within 1 hour as the search budget, a fuzzer can execute at most 680 RPC function calls. This would significantly limit the fuzzer in terms of cost-effectiveness. To better enable our approach in industrial settings, by taking advantage of existing
SQL handling in
EvoMaster, we developed an automated
smart clean on the database, by considering only what tables are actually modified during the search. With the smart clean, after a test is executed, data only in the accessed tables and linked tables (e.g., with foreign key) will be removed. In addition, we also allow SQL commands/scripts to initialize data into the database (e.g., for username/password authentication info). If a table that has initial data is cleaned, a post action will be performed to add the initial data for the table again. With such smart database clean, we could effectively reduce time spent by more than 90%, e.g., from 5.3 seconds to 285 milliseconds. This is because there can be tens/hundreds of tables in an industrial API, but only few of them are actually accessed during the executing of a single test. However, how to reset the state of the databases with a large amount of existing data still needs to be addressed. Besides the database, the states of direct connected external services also need to be reset. Currently, fuzzing by our approach is performed on the industrial test environment where all services are up and running. With such an environment, the states of external services might be varied over time (e.g., failed tests as discussed in Section
5.5). Mocking technique could be a potential solution to address this, e.g., set up specific states of the external services before test execution. However, mocking RPC-based services in microservices is also challenging, e.g., due to network communications and environment setup in industrial settings. It could be considered as important future work.
Real industrial APIs have more complex inputs and apply stricter constraints on input validations with considerations of various aspects. By checking code coverage and fault detection, we found that most codes and faults are related to the parts of implementation for input validation. One reason could be due to the complexity of the input with cycle objects and collections in DTOs. For instance, we found that a DTO is initialized with more than 2k lines, and generating a valid input for such a huge DTO would not be trivial. In enterprise applications, often there exist several constraints on the inputs when processing their business logic. This can lead to major challenges for automated testing approaches to generate such inputs. The input validation is performed at two levels, i.e., in the schema and business logic. The schema level would perform simple checks (e.g., null, format, and range) and checks on constraints related to multiple fields in inputs. Although we have supported the handling of all these constraints defined with javax annotations, it is clear that it is not enough in industrial settings. Because not all constraints are fully specified in the interface definitions, e.g., with javax.validation.constraints, the validation could be implemented as a utility or with libraries, e.g., com.google.common.base.Preconditions, directly in the code of the business logic. To address this, further white-box handling is required to provide more effective gradient to cover the code.
Regarding the validation in terms of business logic, it could perform a check with database and external services.
Data preparation in database and mocking external services would be vital in the testing of industrial RPC-based APIs, not only for input validation. For databases, currently our approach employs the
SQL handling in
EvoMaster [
24] for facilitating data preparation in the database. However, as identified in this study, there might exist some limitations in handling industrial settings cost-effectively, e.g., currently
EvoMaster lacks support for composite primary keys. This does limit the performance on code coverage. For instance, we found that a query action with no input parameters is always failing with an exception thrown. In this case, we could do nothing by manipulating the input parameters. Then, with a manual check on the source code, we found that the query is required to have data in the database, but the data fails to be generated due to some unsupported SQL features. In addition, with only SQL query heuristics, it might not be cost-effective to build meaningful links between RPC function calls and inserted data into the database. Smart strategies would be required here to handle industrial RPC-based APIs, such as the enhanced SQL handling strategies for REST APIs [
72]. For external services, if we could mock such external services, then the problem might be solved by directly manipulating their responses. Automating such manipulation as parts of the search would be another important challenge.
Another possibility to improve code coverage would be to develop advanced search operators for the RPC domain. For instance, we found that, in the generated tests, function calls in a test may not be related to each other for testing a meaningful scenario. In order to better generate tests with related function calls, we could have strategies to sample function calls by considering dependency among functions (e.g., [
76]) in the context of RPC testing, e.g., based on which SQL tables they do access.
An industrial RPC-based API is often a part of large-scale microservices that closely interacts with multiple APIs. Such interaction would result in a huge search space. To test a single API or an API in a small-scale microservice system, testing targets (such as lines of code) could be feasible to reach with an empty database (with/without a small amount of data initialized by SQL script) by manipulating input parameters and data into the database (e.g.,
INSERT). However, testing an industrial API in microservices is not like this case. As the example shown in Figure
1,
the states of other services and databases often have a strong impact in processing business logic that would result in code coverage. Therefore, all such possible states are considered as a part of search. In this article, we provide descriptive statistics for 54 industrial APIs with #
LoC \(_j\) (in total 1,489,959). All of the APIs are parts of one microservice architecture, and there exist hundreds of other APIs that were not used in these experiments. To cope with such a huge complexity of the state, an empty database (as we employed) might limit performance. In addition, as discussed with our industrial partner, they think that
it is important to involve their real historical data (collected in production) in the automated testing. Likely it would improve the chances to cover more of their business scenarios in the generated tests. Furthermore, such tests would be more valuable for them, e.g., they would consider that all faults identified by these tests would have higher priority to be addressed. However, such data is complex and possibly huge, and how to effectively and efficiently utilize this data with search would be another research challenge that we will address in the future.
Enabling fuzzers on CI would promote their adoption in industrial settings. Our approach is now integrated into one of the industrial development pipelines (same as for the experiments we ran in this article), as a trial to check its applicability into the daily testing activities of our industrial partner. Since all services are developed with the same framework, by studying one of the
EvoMaster driver configurations for our approach, our industrial partner has implemented an automated solution to automatically generate such drivers for their services to be tested (e.g., identify all available interfaces and instantiate corresponding clients). For instance, the drivers of the 50 industrial APIs in Table
6 were automatically generated with this automated solution. Regarding the application context, as discussed with our industrial partner, our approach is planned to be employed on the services for generating white-box system tests when the implementation for a requirement of the services is considered as done, as a kind of extra check before putting these new features into production. In addition, the generated tests would be kept for further usage in (1) regression testing of the services and (2) industrial test environment validation as scheduled tasks (e.g., to see whether all services on the pipeline are up and running correctly before QA engineers start manual test sessions).
Flakiness and readability are required to be considered in test generation in industrial APIs. As we found in the industrial APIs, responses could contain information such as timestamps and random tokens, and they could change over time. In order to avoid test failing due to such flakiness, we defined some strategies with general keywords (e.g., date, token, time) observed in the industrial APIs to comment out assertions with such sources. How to systematically identify possible sources of flakiness existing in the industrial APIs (e.g., timestamps, results of SQL queries) and properly handle it during the automation and in the test generation would be another important problem that researchers should address. During the process of reviewing the generated tests with an industrial partner, we found that test readability requires improvement. This is mainly due to very large blocks of code for input instantiations and large numbers of tests in the test suites. As identified in the review, our industrial partner found that the tests that lead to exception thrown are more interesting for them. Therefore, to improve test readability, we now provide a simple strategy to split such tests into different files (the implementation is straightforward, but it is quite useful for our partner). Further possible improvements could be achieved by better organizing the code for large input instantiations, and sorting/splitting tests based on various considerations, e.g., fault classification [
55].
7 Threats to Validity
Conclusion validity. Our study is in the context of SBST, and our experiments were conducted by following common guidelines in the literature to assess randomized techniques [
22]. For instance, with a consideration for the stochastic nature of the employed search algorithms, we collected results for all settings with at least 30 repetitions. The results were interpreted with statistical analysis, such as Mann-Whitney-Wilcoxon U-tests (
\(p\) \(value\) ) and Vargha-Delaney effect sizes (
\(\hat{A}_{12}\) ) for pair comparisons. Regarding fault detection capability, a number of real faults were identified, and those were reviewed together with our industrial partner. Regarding the choice of search budget, since the time cost of executing RPC calls might vary depending on the operating environments (e.g., hardware and OS), we employed a fixed number of RPC calls as the search budget (i.e., 100,000), in order to make our experiment replicable. Studying different settings of search budgets (such as 1 million RPC calls, 1 hour, 24 hours, or 48 hours) might provide us more insights and more concrete evidence for drawing conclusions relating to the choice of the search budget (e.g., more budget might result in better performance on
CS3 and
CS4, as discussed in Section
5.4.2). However, it is expensive to conduct such an empirical experiment with industrial APIs, as these APIs are typically large-scaled and complex. For instance, with one search budget setting (i.e., 100,000 RPC calls), the computational cost of two settings with 30 repetitions is 129.12 days for the four APIs. Therefore, we consider the experiments with various search budgets as possible future work.
Construct validity. To avoid bias in the results among different settings and techniques, all results to be compared were executed on the same physical environment, e.g., experiments on artificial case studies were deployed on a local machine, and experiments on industrial case studies were deployed on the pipeline of our industrial partners.
Internal validity, Our implementation was tested with various unit tests and end-to-end tests, but we cannot guarantee no fault in our implementation. However, our tool and artificial case studies are open source. This enables further verification on our implementation and replication of our experiments on the artificial case studies by other researchers. Note that, due to the confidential info of the employed industrial case studies, detailed results of these industrial APIs cannot be made publicly available.
External validity. In this study, our approach was assessed with artificial case studies using Thrift and 54 industrial case studies (from one company) using their own RPC framework that is initially built based on Thrift. There might exist a threat to generalize our results to other RPC frameworks or other companies. Experiments on real-world industrial APIs show the usefulness and scalability of our novel techniques in practice. However, these results cannot be replicated by other researchers, as such industrial APIs are not publicly available. Collecting and preparing a corpus of non-trivial open source RPC-based APIs for experimentation (e.g., like EMB [
4] for RESTful APIs) will be important for future research work.
8 Conclusion
RPC is widely applied in industry for developing large-scale distributed systems, such as microservices. However, automated testing of such systems is very challenging. To the best of our knowledge, there does not exist any tool or solution in the research literature that could enable automated testing of such systems. Therefore, having such a solution with tool support could bring significant benefits to industrial practice.
In this article, we propose the first approach for automatically white-box fuzzing RPC-based APIs, using search-based techniques. To access the RPC-based APIs, the approach is developed by extracting available RPC functions with RPCInterfaces from the source code. This can enable its adoption to most RPC frameworks in the context of white-box testing. To enable search techniques (e.g., MIO) in the RPC domain, we reformulate the problem and propose additional handling and heuristics specialized for RPC.
The approach is implemented as an open source tool built on top of our
EvoMaster [
3] fuzzer. A detailed empirical study of our novel approach was conducted with two artificial and four industrial APIs, plus a preliminary (e.g., no fault analysis) study on a further 50 APIs. In total, more than a million lines of business code (excluding third-party libraries) were used in this study. When third-party libraries are considered as well (e.g., for carrying out
taint analysis [
26]), several millions of lines of code were analyzed and executed in these experiments.
Results demonstrate the successful applicability of our novel approach in industrial settings. Our tool extension presented in this article is already in daily use in the Continuous Integration systems of Meituan, a large e-commerce enterprise with hundreds of millions of customers. In addition, to evaluate the effectiveness of our approach in the context of white-box search-based testing, we compared our approach with a gray-box technique. The results show that our approach achieves significant improvements on code coverage. To further evaluate the capability of fault detection, we carried out an in-depth manual review with one employee of our industrial partner on the tests generated by our novel approach. A total of 41 real faults were identified that have now been fixed. Another 8,377 detected faults are currently under investigation.
Considering how widely used RPC frameworks such as Apache Thrift, Apache Dubbo, gRPC, and SOFARPC have been in industry in the last decade, it can be surprising to see how such an important software engineering topic has been practically ignored by the research community so far. One possible explanation is the lack of easy access to case studies for researchers, as these kinds of systems are used to build enterprise applications. Therefore, these systems are seldom available on open source repositories, or online on the internet (i.e., general access web services are usually developed as REST APIs). To be able to empirically evaluate our novel techniques, industry collaborations (e.g., with Meituan) were a strong requirement.
Although our tool extension is already of use for practitioners in industry, more needs to be done to achieve better results. Future work will focus on improving white-box heuristics to increase the achieved code coverage, and how to handle and analyze the interactions with external web services.
Our tool extension of
EvoMaster is freely available online on GitHub [
3] and Zenodo (e.g.,
EvoMaster version 1.5.0 [
28]), and the replication package for this study can be found at the following link.
1