We implemented
CompCheck with a combination of Java programs and Python scripts. We used the ASM [
7] Java bytecode instrumentation framework to collect and analyze execution traces, and the XStream [
89] library for saving object states. We stored the knowledge base and object traces in the JSON and XML formats, respectively. We built control flow graphs of clients and implemented caller slicing using Soot [
39]. We implemented the test generation component as an extension of the EvoSuite [
33] search-based test generator. When generating an argument of the caller method of the target API, if the context matches knowledge and
CompCheck has control over an argument, then
CompCheck loads a stored object instead of searching from scratch. Additionally, we customized EvoSuite to not use any code entity from libraries that are removed during the upgrade. This is to prevent the target call site from being hidden by shallow errors (e.g.,
NoClassDefFoundError), which are less interesting as they can be easily located and fixed. When executing test generation, we set the target class/method for EvoSuite as the class/method containing the call site, using default values for other parameters.
6.1 Impact of Matching Strategies
To study the impact of different matching strategies, we first sampled a set of incompatible APIs from the knowledge base, then manually labeled a set of call sites for each API. Finally, we ran CompCheck’s context matching on these call sites with different matching strategies, comparing the precision and recall of each strategy.
Table
1 shows the backward incompatible APIs used in our experiments. From left to right, the columns show the API IDs, API names and argument types, library names, old version numbers, new version numbers, and the number of known call sites of the APIs in the knowledge base, respectively. Some names are shortened to save space, and an expanded table is available online [
18]. We used all 24 incompatible APIs as the knowledge for the matching experiment, requiring that each knowledge must have at least two known failing call sites.
We extracted call sites of APIs from the projects we collected in Section
4.1.1—that is, the 1,225 top-starred Maven projects on GitHub. For each API, we went through the client projects and extracted its call sites from the projects that use the API. We kept adding call sites until most APIs in Table
1 had no less than five call sites. In total, we collected 202 call sites from 37 client projects, which is shown in Table
2. During the collection, we found that many call sites did not expose incompatible behaviors in the module-level regression testing: they were either not covered by the developers’ test cases, not checked by the existing test assertions, or their execution did not trigger errors with the developer-provided inputs. This further motivates our work. When incompatibility issues cannot be exposed by the developer’s existing tests, client developers may utilize
CompCheck to enhance their test suite to test the compatibility of their libraries. For each client project, we show its URL on GitHub, LOC, the number of stars, and the version (SHA) analyzed. We used all 202 call sites as the targets for the matching experiment.
We manually labeled each call site through inspection and test creation. If we could manually create an incompatibility-revealing test for a call site, we labeled it as revealable (+); otherwise, if we failed to create a test to expose its incompatibility, we labeled it as unrevealable (–). As a result, of the 202 call sites, we labeled 104 as revealable and 98 as unrevealable. A call site being revealable means that its incompatibility can be exposed by a manually created test. Therefore, an unrevealable call site is “compatible” in theory, but there could be some noises due to mislabeling. We discuss the limitation of the labeling in Section
6.4. Intuitively, a call site is unrevealable if, given its context, the API’s incompatibility cannot be exposed. A typical example of an unrevealable call site is shown in Figure
2(e). The client calls
kryo.setRegistrationRequired(false) (line 4) to explicitly disable the registration checking, thus the API call site (line 6) can never trigger the incompatibility (“IllegalArgumentException: Class is not registered”), no matter what arguments are provided to the API.
We experimented with the following matching strategies:
\(S^*\), disabling switches S1 and S2, enforcing exact match on argument context and primitive values;
\(S^*_{\mathrm{poly}}\), enabling S1 and disabling S2;
\(S^*_{\mathrm{prim}}\), disabling S1 and enabling S2;
\(S^*_{\mathrm{poly,prim}}\), enabling both S1 and S2; and
\(S^*_{\mathrm{\lnot coevo}}\), same as
\(S^*\) but using Equation (
1) to compute the confidence score—that is, the calculation is solely based on the number of matched arguments, not utilizing co-evolution information. Note that all the strategies use Equation (
2) to compute the confidence score except
\(S^*_{\mathrm{\lnot coevo}}\). The strategies
\(S^*\),
\(S^*_{\mathrm{poly}}\),
\(S^*_{\mathrm{prim}}\), and
\(S^*_{\mathrm{poly,prim}}\) apply different combinations of switches.
\(S^*_{\mathrm{\lnot coevo}}\) is designed to study whether the co-evolution information can help in matching (by comparing the curves of
\(S^*\) and
\(S^*_{\mathrm{\lnot coevo}}\)).
Table
3 shows the precision, recall, and F
\(_1\) scores of context matching with different matching strategies. A higher F
\(_1\) score indicates a better balance between the precision and recall. The precision, recall, and F
\(_1\) scores are computed according to formula Equations (a), (b), and (c), respectively. If a call site is manually identified as unrevealable but
CompCheck matches it, we consider it as a false positive of matching; if a call site is manually identified as revealable but
CompCheck does not match it, then we consider it as a false negative of matching. From top to bottom in Table
3, each row shows the scores given a certain confidence threshold value (from 0.0 to 1.0, at the step of 0.1). The maximal F
\(_1\) scores in each row are colored in orange, whereas the global maximal F
\(_1\) score is colored in blue.
The result indicates that the best strategy is \(S^*_{\mathrm{poly,prim}}\) with the confidence threshold set to 0.7 (F\(_1\) score = 0.861). This confirms our intuition that tolerating reasonable difference in context matching could improve the recall without much loss of the precision. We used it as our default matching strategy for the experiment for RQ3 through RQ5. Another observation is that \(S^*\) generally performs better than \(S^*_{\mathrm{\lnot coevo}}\), indicating the library co-evolution information can improve the effectiveness of context matching. For \(S^*\) and \(S^*_{\mathrm{\lnot coevo}}\), as the threshold value increases from 0.0 to 1.0, recall keeps decreasing while precision keeps increasing. This is because as the threshold goes higher, fewer call sites can be matched, but each call site is matched with higher confidence. For the other strategies, since they all tolerate differences in the contexts, they may have false positives even when the confidence threshold is 1.0 and thus can have precision scores lower than 1.00 in the bottom row.
6.2 Effectiveness of Incompatibility Discovery
To answer RQ3 and RQ4, we ran
CompCheck’s incompatibility discovery on 202 call sites collected from the clients in Table
2. For RQ3, we evaluated
CompCheck’s effectiveness based on how many call sites of incompatible APIs it could reveal successfully. For RQ4, to show
CompCheck’s effectiveness compared with existing techniques, we performed an end-to-end comparison on
CompCheck and two other techniques: (1) Sensor [
84], a state-of-the-art technique that detects dependency (library) conflicts by test generation guided by client code analysis, and (2) CIA+SBST, a technique that detects library incompatibility by combining change impact analysis with search-based test generation. In these experiments, we configured
CompCheck to use the default strategy
\(S^*_{\mathrm{poly,prim}}\) for knowledge matching. We give details of each experiment setup later in this section. The experiments were performed on a four-core Intel Core i7-8650 CPU @ 1.90-GHz machine with 16 GB of RAM, running Ubuntu 16.04, with Java 1.8.0_151 and Python 3.7.3.
The end-to-end comparison experiment (RQ4) was conducted on
CompCheck and two other techniques: Sensor and CIA+SBST. Our rationale for selecting these techniques for comparison is as follows. First, Sensor is the most recent state-of-the-art technique for detecting library incompatibilities (named
dependency conflicts in the work of Wang et al. [
84]) in a Java project. From the technical perspective, both
CompCheck and Sensor use program analysis to guide test generation, where the main difference is that when generating tests,
CompCheck utilizes runtime input values recorded in the knowledge base, which are known to cause incompatibility in some other clients previously, whereas Sensor relies on the parameter values extracted from the client code under test. The implementation of Sensor is also publicly available. Second, the goal of
differential regression test generation [
16,
25] is quite similar to that of
CompCheck. In fact, it can be used to reveal changes by generating a set of targeted tests. Yet, to the best of our knowledge, the only available tool implementation, EvoSuiteR [
25], is not client-oriented: it is able to generate tests for changed (library) classes but not for target clients relying on them. Therefore, for a fair comparison, we implemented an end-to-end compatibility checking technique on top of EvoSuite, based on the idea of EvoSuiteR, and made it client-oriented with the help of change impact analysis [
74]. We name this combination
CIA+
SBST and use it as another baseline to compare with
CompCheck.
Note that many other existing compatibility checking tools are not comparable with
CompCheck by their nature.
CompCheck is client-oriented and client-specific—that is, it is designed for developers of client applications and only focuses on incompatibility issues that can be manifested at a given client context. Smart alerting [
83] and Dependabot [
19] are client-oriented but not client-specific. They warn client developers whenever there is a known severe bug in the new library version without providing concrete test cases, even if the buggy code is not used by the client. DeBBI [
15] is library-oriented and aims to help library developers test their code more effectively with the support of additional test suites. GemChecker [
17], AURA [
88], and HiMa [
57] aim to detect deprecated APIs in new library versions or discover API migration rules rather than revealing behavioral incompatibility.
The experiment was performed on 202 call sites collected from the clients in Table
2. For
CompCheck’s call site matching, we used its default strategy
\(S^*_{\mathrm{poly,prim}}\). For Sensor, we first modified the
pom.xml files of each client project to include both the old and the new versions of the target library, then enforced Maven to load both library files, and manually ran Maven’s debugging mode in each client to make sure both
.jar files were present on the class path. For single-module Maven projects, we ran Sensor in the root directory; for multi-module Maven projects, we ran Sensor in both the root and the sub-module directories containing the call site of the target API and reported the combined results. For cases in which Sensor detected conflict libraries but did not attempt to generate tests, we ran the standard EvoSuite (the test generator used by Sensor) on top of Sensor’s detection result. If the standard EvoSuite generated any incompatability-revealing tests, we considered it as a success of Sensor as well.
For CIA+SBST, we ran change impact analysis on both library and client at the Java bytecode level. Given an old version and a new version of the library, CIA+SBST uses Chianti’s impact analysis algorithm to identify the impacted code entities of the client, obtaining a list of public methods affected by the changes of the library. Next, the list is passed to the search-based test generator (i.e., EvoSuite), which in turn uses these methods as test generation targets and finally outputs of a set of generated tests.
To compare the tools’ effectiveness, the tests generated by all three tools were executed on the old and new library versions separately, and we compared their numbers of generated incompatability-revealing tests, which passed with the old version but failed with the new version. The more incompatibility issues a tool can reveal with its generated incompatibility-revealing tests, the more effective it is. We set a time budget of 30 minutes for each target call site for all the tools. To avoid fluctuations caused by randomness, we ran each tool 10 times and reported the average.
Figure
9(a) shows the results of the experiment. By knowledge matching,
CompCheck matches 128 call sites as targets, of which it successfully generated incompatibility-revealing tests for 76 call sites. In contrast, Sensor was able to detect incompatibility on 44 call sites in total. CIA+SBST treats 202 call sites as targets because all these call sites are affected by the library changes, but it managed to generate incompatibility-revealing tests for only 39 of them. Recall that for incompatibility discovery, an incompatibility issue is considered successfully discovered by a tool if the tool can generate an incompatibility-revealing test on the call site of the incompatible API. Therefore, in this experiment,
CompCheck successfully discovered 76 incompatibility issues, whereas Sensor and CIA+SBST each discovered 44 and 39 issues, respectively. The result indicates that
CompCheck is more effective than both Sensor and CIA+SBST.
Figure
10 shows an example client method where
CompCheck outperforms Sensor in generating incompatibility-revealing tests. The code snippet is from the XStream [
89] project. In this example, the incompatible API is
DateTimeFormatter.parseDateTime from the Joda-Time [
45] library (line 5), which accepts two input variables:
formatter and
str. The incompatibility issue is that when Joda-Time was upgraded to the new version, the parsing logic of certain date values was changed and thus can cause exceptions in its clients [
20]. To trigger the incompatible behavior, the “str” variable must be of specific values (the string representations of certain dates). Thus, randomly generated strings can hardly trigger it. In this case,
CompCheck was able to generate incompatibility-revealing tests because it directly reused the string values mined from the test execution of Amazon’s AWS Java SDK [
8] in its knowledge mining phase—
292278994-08-17T07:12:55.807Z, which was also confirmed by the AWS developers in the release notes [
9]. However, Sensor was not able to generate incompatibility-revealing tests, as its
class instance pool [
84] only contains the objects and primitives from the project under test (XStream in this case). However, no relevant string value, which triggers this incompatibility issue, appears in XStream’s code base. As a result, Sensor failed to find useful seeds for its test generation and thus could not reveal this incompatibility issue. This example demonstrates the benefit of
CompCheck’s knowledge-based approach. With the help of past knowledge mined from other projects, it can discover new incompatibility issues in unseen clients.
Out of the 76 incompatibility issues revealed by CompCheck, 44 were manifested as direct incompatibility, 13 were manifested as transitive incompatibility, and 19 were manifested as co-evolution incompatibility. This indicates that CompCheck is able to reveal incompatibility issues of all the manifestation patterns. In total, 297 test methods were generated by CompCheck. Among these tests, 103 are incompatibility-revealing tests and 194 are non-incompatibility-revealing tests. CompCheck utilizes EvoSuite as its underlying test generator. In general, EvoSuite generates one test class per call site, where each test class can contain multiple test methods. The number of test methods in a test class varies. Each test method corresponds to one client calling context of the target API.
There are 52 call sites where
CompCheck matched but was not able to generate incompatibility-revealing tests. We inspected them, finding that a common reason is that they have arguments that are too complex to generate within the time bound. For example, Figure
11 shows a client method in the project Heritrix [
5]. The incompatible API is
Base64.decodeBase64 (line 12). The client (caller) method
PersistProcessor.populatePersistEnvFromLog accepts as input a
BufferedReader object. This
BufferedReader object is supposed to be initialized from a text file with each line separated by a space. The method then extracts certain contents from the file, processes them, and passes them to the target API. It is almost impossible to construct such a specific
BufferedReader by searching. Furthermore,
CompCheck’s caller slicing cannot be applied to the
BufferedReader argument here, as the target API call site has data dependency on it. On these call sites, none of the experimented tools was able to generate incompatibility-revealing tests.
6.3 Effectiveness Contributions of Technical Components
For RQ5, to measure the improvement brought by object reusing, we compared the effectiveness of
CompCheck with its variant that has this feature disabled; to measure the improvement brought by the optimizations (i.e., type conversion table and caller slicing), we compared the effectiveness of
CompCheck with its variant that disables the optimizations. Same as in Section
6.2, in these experiments, we used the default strategy
\(S^*_{\mathrm{poly,prim}}\) for
CompCheck’s knowledge matching. We give details of each experiment setup later in this section. The experiments were performed in the same environment as in Section
6.2.
To evaluate the benefit of object reusing in generating incompatibility-revealing tests, we compared
CompCheck with its variant that only disables object reusing (named
CompCheck\(^{{\it --}}\)). From an implementation perspective,
CompCheck\(^{{\it --}}\) is equivalent to running EvoSuite on the knowledge-matched target call sites. The experiment setup is similar to the previous one. As shown in Figure
9(b), out of the 128 matched call sites,
CompCheck\(^{{\it --}}\) managed to generate incompatibility-revealing tests on 52 while
CompCheck succeeded on 76. This indicates that object reusing can significantly help improve the effectiveness of test generation (24 more issues revealed). As an example, the incompatibility issue of k2 can be exposed only when it takes as input an object with final fields annotated with the
@Option annotation (an annotation class in Args4j). For plain search-based test generation (used by
CompCheck\(^{{\it --}}\)), such an object is extremely hard to create. However, with object reusing,
CompCheck can reuse a stored object from the knowledge base, cutting the search space of the object and achieving significant savings. Note that here we compare
CompCheck with
CompCheck\(^{{\it --}}\) on 128 call sites because improvement is only possible when the knowledge is matched. For other call sites,
CompCheck behaves exactly the same as
CompCheck\(^{{\it --}}\), making the comparison meaningless.
To measure the benefit of
CompCheck’s optimizations in incompatibility discovery, we compared
CompCheck with its two variants,
CompCheck\(^{{\it --table}}\) and
CompCheck\(^{{\it --slice}}\), each disabling one optimization—type conversion table and caller slicing, respectively (both optimizations are enabled by default in
CompCheck). As shown in Figure
9(b), both variants perform less effectively than
CompCheck, which confirms that both optimizations improve the effectiveness of incompatibility discovery. Type conversion table and caller slicing each helped solve 5 and 20 cases, respectively.
6.4 Threats to Validity
Our evaluation is subject to the following threats to validity.
External. Projects that we used for the experiments may not be representative of all software projects. To mitigate this threat, we chose popular open source projects from GitHub that use Maven—a widely used project management tool for Java—as their build system. Our current implementation of CompCheck supports only Java, but our methodology and workflow can apply to any programming language. However, for weakly typed or untyped languages, such as Python and JavaScript, the matching strategy should be redesigned as we are not able to use type information to control the matching sensitivity.
The knowledge base used may not be representative. We limited our experiments with a knowledge base built at the latest versions of the libraries at the time of the study. For the library pairs, the old ones are usually more than a year old and the new ones are the most recent release (median age is 5.5 months) at the time of experimenting. The age of the clients does not matter for knowledge mining, since we upgrade their libraries automatically. We focused on high-starred projects on GitHub, as they generally have high test coverage and representative usages of libraries. By considering the latest version of the libraries, we simulate the scenario where knowledge is mined right after a library is released. In a realistic setting, such as deploying CompCheck on a CI environment, one would monitor the releases of the libraries of interest and perform knowledge mining after every new release, resulting in a much richer knowledge base over time.
Internal. When manually labeling the call sites of incompatible APIs for the matching strategy experiment, the call sites labeled as revealable are guaranteed to be revealable, as we could create tests for them; but for those call sites labeled as unrevealable, we cannot guarantee it is unrevealable under all possible inputs, because we could not enumerate all input combinations. Thus, the mislabeling of some revealable call sites as unrevealable is a possible threat. As a result, when calculating the precision and recall values, our computation may be an under-approximation for the precision and an over-approximation for the recall. In our experiment, on all the call sites that are manually labeled as unrevealable, none of the experimented tools was able to generate an incompatibility-revealing test. Therefore, we did not observe mislabeled call sites in practice.
Test flakiness may affect the soundness of
CompCheck. In knowledge mining, if a client test fails intermittently,
CompCheck may treat it as an incompatibility failure by mistake. To mitigate this threat, in the module-level regression testing phase, for each client, the initial clean test run without upgrading any library was executed three times, and
CompCheck accepts the project only if the test suite passes in all three runs. In this way, we can alleviate the harm of “shallow” flaky tests that are easy to expose. For incompatibility discovery,
CompCheck uses EvoSuite as the test generator, which avoids generating flaky tests by design [
26]. Additionally, we manually inspected all the incompatibility-revealing tests generated by
CompCheck and confirmed all the failures were indeed caused by incompatibility instead of flakiness.
CompCheck implementation or any scripts we wrote to run experiments may contain bugs. To mitigate this threat, we reviewed code thoroughly and wrote unit tests.