Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Evaluating Automated Unit Testing in Sulu

2008, 2008 International Conference on Software Testing, Verification, and Validation

2008 International Conference on Software Testing, Verification, and Validation Evaluating Automated Unit Testing in Sulu Roy Patrick Tan and Stephen Edwards Department of Computer Science 660 McBryde Hall (0106) Blacksburg, Virginia {rtan, edwards}@cs.vt.edu Abstract tomated unit testing tools for programs written in this language. The Sulu programming language provides a novel outlook, it was developed with a constant eye towards its effect on unit testing. In this paper we discuss some of the main features of the Sulu language, and the tools developed for automating the process of unit testing in Sulu. We then describe an experiment to assess the practical effectiveness of our approach, by running automatically generated test suites against a reference set of small software components. We evaluate the resulting data, and then conclude with some discussion of the results of the experiment, as well as some related work. Sulu is a programming language designed with automated unit testing specifically in mind. One aim of Sulu is to demonstrate how automated software testing can be more integrated into current software development processes. Sulu’s runtime and tools support automated testing from end to end; automating the generation, execution, and evaluation of test suites using both code coverage and mutation analysis. To show the effectiveness of this integrated approach, we performed an experiment to evaluate a family of test suites generated using a test case generation algorithm which exhaustively enumerates every sequence of method calls within a certain bound. The results show high code coverage, including 90% statement coverage and high mutation coverage for the most comprehensive test suite generated. 2. The Sulu Language and Tools Sulu is a programming language influenced mainly by the Resolve [21] language and discipline and the Java Modeling Language (JML) [16], particularly work by Cheon and his colleagues [8] on the JMLUnit method for automated unit testing. Sulu is named after an island in the Philippines; in pre-colonial times, Sulu was a center for barter trade in Southeast Asia, and the authors thought it apt, since swapping [14, 24] is one of the language’s major features. While many of the testing techniques in this paper do not require Sulu, we designed the language to bring together disparate elements of a truly automated testing system. The Sulu system includes an interpreter for the language, a pluggable mechanism for automatic test case generation, a test execution tool that uses design by contract specification as a test oracle, a code coverage profiler, and an extensible mutation testing tool. Figure 1 is a diagram of the overall architecture of the Sulu tools for automated testing. The only input required by these tools is the component under test, which is written in the Sulu programming language. A Sulu software component is composed of a design-by-contract specification of its own behavioral requirements (called a concept), and its implementation. 1. Introduction The essence of software testing is the comparison of the actual execution of a piece of software against that piece of software’s expected behavior. As such, any attempt at automating the whole process of unit testing involves the mechanical generation of test cases that will exercise the software unit, the execution of these test cases, and an automated mechanism for determining whether the software behaved as expected. However, to be of any practical use to the software development professional, he must also be able to measure the thoroughness of the automatically generated test suite. To realize the vision of an integrated platform for automated testing, the authors have designed a programming language called Sulu, implemented an interpreter for it, and constructed automated testing tools that can generate tests automatically, run these tests, and evaluate these tests via code coverage and mutation analysis metrics. The goal in the development of Sulu is to facilitate the integration of au- 978-0-7695-3127-4/08 0-7695-3127-X/08 $25.00 $25.00 © 2008 © 2008 IEEE IEEE DOI 10.1109/ICST.2008.59 62 can be run against the implementation (realization) by the test execution tool. In Sulu, test case generators are implemented as plug-ins where the developer can construct his own test case generators. For this research, we have built a test case generator that can generate a family of test suites by exhaustive enumeration of method calls. Both automatically generated test suites and manually written ones are in the same format, and thus the text case execution tool can run either manually, or automatically generated test suites with equal ease. As a test case is being executed, we have to determine whether the software under test has behaved correctly or not (i.e., the oracle problem [13]). We use the design-bycontract postconditions as the test oracle. If a postcondition fails because of a test case, we flag that test case also as failed, i.e. a bug was found. If the postcondition does not fail, we say that the test case passed, the software under test behaved correctly. In addition to classifying a test case as passed or failed, Sulu also classifies some test cases as invalid. A test case is invalid when it exercises the component under test in a manner that is prohibited by its specification, i.e., it violates a precondition. Figure 1. The Sulu automated testing system 2.1. Specifying Software in Sulu Figure 2 is an example of a specification for a typical Stack component. The specification for the stack here is fairly standard; we model a stack component as a mathematical sequence of items, with the top of the stack as the first element of the sequence. The initially section on line 3 provides the initial state of the model: an empty sequence. After every method signature, one can put in design-bycontract preconditions in a requires clause, and postconditions in an ensures clause. The contract for the pop method on line 9, for example, requires that the sequence is not empty when the method is called, and ensures that the sequence in the pre-state (the old sequence) is the same as the sequence in the post-state with the popped element added to the beginning. The old(...) construct is used to denote that the expression encapsulated between the parentheses is evaluated in the state before the method is called. Note that this is not a function call. All procedures in Sulu are attached to objects as methods. Although not shown in the figure, Sulu also supports class invariants with an invariant section. A property of these specifications is that they can be executed. When we model a stack as a sequence of items, it is possible for us to actually construct a programmatic sequence of items such that we can check, for example that the pop method actually decreases the length of the sequence by one, and produces the right object. A Sulu concept is similar to a Java interface, it defines the method signatures that must be implemented; and there can be many differing implementations of that concept. An implementation in Sulu is called a realization. 2.3. Evaluation of Test Suites After test cases are generated and executed, we want to have some metric of the effectiveness of the test suite. Sulu uses two different test coverage criteria to measure this, code coverage and mutation coverage. The code coverage profiler measures statement, decision, and condition/decision [15] coverage. Mutation analysis involves seeding bugs into otherwise acceptable code. Typically, a program is transformed via a mutation operator that produces a set of programs each of which is identical to the original aside from a single change. A mutation operator may change a plus to a minus, for example. If you had a program that contained three addition operators, the mutant generator would generate three programs, for each one converting an addition to a subtraction. A mutation analysis tool will run each test suite against every mutant; if the test suite detected the bug (a test case failed), the mutant is said to be killed. Every mutation operator may be capable of producing a large number of mutants, and thus a judicious selection of mutation operators must be used to make mutation analysis feasible. The Sulu mutation analysis framework is extensible; it allows programmers to implement their own mutation operators by extending an iterator-like abstract class that allows the mutation analysis tool to access every generated mutant. For the purposes of this research, however, we implemented close variants of a set of mutation operators reported by Andrews and his colleagues [2] that have some evidence of corresponding to real bugs. 2.2. Tools for Generation and Execution Automatic test case generators then take the specification (concept) and produce one or more test suites, which 63 1 concept Stack( Item ) { 2 initially this.getSequence().length() == 0; 3 4 method push( element: Item ) ensures this.getSequence() == old( this.getSequence().insertFirst(element.getMathObject()) ); 5 6 7 8 method pop( element: Item ) requires this.getSequence().length() > 0 ensures old( this.getSequence() ) == this.getSequence().insertFirst( element.getMathObject() ); 9 10 11 12 13 method length(): Int ensures length.equals( this.getSequence().length() ); 14 15 16 model method getSequence(): concept Sequence(concept MathObject() ); 17 18 19 } Figure 2. A Stack a specification 3. Evaluating The Effectiveness of Exhaustive Enumeration of Method Sequences for every parameter of most Sulu method calls. It is often the case, however, that different parameter values may result in different behavior, and thus uncover more bugs. Our test case generator thus can generate test cases that enumerate all n-sequences of method calls for k possible different parameter values for every formal parameter in a method call. That is, if there are p formal parameters in a sequence of method calls, we generate pk test cases for that sequence. Because of this combinatorial explosion problem, we can only generate test cases where n and k are low. We also encounter the problem of generating parameter values other than the default value for every parameter. Currently, our test case generator generates stubs where the testing professional can fill in additional parameter values for every parameter type encountered in the software component. We generated six different test suites from our test case generator. We generated test suites corresponding to each test case being a single method call, a pair of method calls, and three method calls; and for each of those, a test suite for one parameter value, and a test suite for two parameter values for each formal parameter. For the rest of this paper, we name each test suite by the words “Singles”, “Pairs”, and “Triples” to denote the length of the method sequence, followed by a number to denote the number of values passed for each parameter. Thus “Triples1” denotes a test suite where every test case is a sequence of 3 method calls, and one value is passed for every parameter. Table 1 shows the number of test cases generated for each test suite. Because we are generating test suites from method signatures, test suites generated for different real- Once the software tools described in the previous section were constructed, we were then able to perform an evaluation of our automated unit testing strategy. Our evaluation involves these steps: 1. Implement a test generation algorithm 2. Select a set of reference software components 3. Generate test suites for every software unit 4. Execute test suites and gather code coverage information 5. Select (and implement, if necessary) a set of mutation operators 6. Run every relevant test suite against each mutant and gather mutation coverage 3.1. Test Case Generation The Sulu interpreter allows for different implementations of test case generators. The implementation we used for this experiment is a test case generator that could generate a test suite that exhaustively enumerates every sequence of n method calls given a number n. In addition to the sequence of method calls, a test case needs to provide inputs to methods that have parameters. Since every Sulu variable has a default value, we can automatically provide one value 64 Concept BinaryTree List Map Stack Sorter Vector Singles1 3 5 9 3 5 18 Pairs1 9 25 81 9 25 324 Triples1 27 125 729 27 125 5832 Singles2 17 11 17 5 7 41 Pairs2 289 121 239 25 49 1681 Triples2 4913 1331 4913 125 343 68921 Table 1. Number of tests generated for each software component Concept BinaryTree List Map Stack Sorter Sorter Sorter Sorter Vector Vector Realization Standard TwoStacks Hashtable LinkedList BubbleSort HeapBased ListBased MinFinding ArrayBased LinkedList and software units derived from java.util classes. The binary tree, list, stack and sorting machines come from the Resolve heritage, while the vector and hashtable components are based on Java. The Sorter components recasts the sorting algorithm into a sorting machine that makes it more component-based, following loosely the work by Weide et al. [23]. Typical usage of these components would be calling the insert method for every item in the list to be sorted, then calling sort, then finally calling remove repeatedly, where remove removes the smallest element in the list. There are four different sorting algorithms implemented, representing different strategies of where the work of the sorting is distributed. BubbleSort keeps the objects in arbitrary order when inserted, and does all the work on the call to sort. ListBased keeps the list in sorted order on insert (essentially an insertion sort). The MinFinding realization keeps the list in arbitrary order, and locates the smallest item on every call to remove. And finally HeapBased keeps the objects in a heap structure, thus the work is distributed between the insert and the remove methods. The remaining three components, the Hashtable based map and the two Vector components are based on the java.util library. The Vector components implement most of the Java List interface, while the Hashtable implements most of the Java Map interface. Table 2. Sulu reference components used for evaluation izations of the same concept are essentially identical. When generating test suites where two possible parameter values are entered for every parameter, the programmer must fill in two example values per parameter type. For arbitrary objects, we use integers as the actual type, and the set of parameter values are either 0 or 1. Some methods of the components we tested also took selftype parameters [5, 6]; that is, parameter values that have the same type as the object from which the method is called. Since these are all collection objects, for the selftype values, we use one of either the default value (an empty collection), or the default value with 0 inserted in it. 3.3. Evaluation Criteria 3.2. Component Selection Sulu provides tools to evaluate the effectiveness of the automated testing tools using both code coverage and mutation analysis. There are three different code coverage criteria that are measured: statement coverage, decision coverage, and condition/decision coverage [15]. Statement coverage is simply a count of the number of statements executed versus the total number of statements in the component under test. Decision coverage means that it counts whether for every boolean statement in an if statement and while loop, it evaluates to true at least once, and false at least once. That is, every while loop or if statement can be given a maximum of 2 points: To be able to evaluate test suites generated by this strategy, we need a set of reference software components to use as inputs to the test case generation. Table 2 lists the ten Sulu components (concepts and realizations) used in this evaluation. These software components are collection abstract data types, meant to represent some of the most common data structures and algorithms used by programmers. These components can broadly be separated into two sets, components derived from the Resolve family of components, 65 0 for unevaluated conditionals; in percent mutants killed. The results show what we expected: All pairs of method calls is superior to all single method calls, and all triples cover more than all pairs. Also, a test suite that takes in two parameters does better than the corresponding test suite that only takes one parameter value. Quite notable is the fact that all triples with two parameters have 90% statement coverage, and close to 85% decision and condition/decision coverage. Our most comprehensive test suites also achieve high coverage, with kill ratios of above or nearly 75% for four of the six mutation operators. In addition, we performed statistical analyses to answer the question of how likely these code coverage difference were due to chance. To determine at a 95% confidence level that the differences in coverage is not from random chance, we performed a one-way analysis of variance. At α = 0.05, the analysis of variance found significant differences exist between the test suites on all our coverage measures. We also performed a Tukey’s test as a post-hoc analysis on every pair of test suites, for every coverage measure; this allows us to determine which specific pairs of test suites are significantly different. Table 5 shows the results of every pairwise Tukey’s test. Each sub-table represents Tukey’s test performed on all pairs of test suites on a certain measure of thoroughness. The numbers to the right of each table is the cumulative percent of coverage. Each column of letters represent cluster of means where we cannot reject the null hypothesis. That is, for each column, the test suites that are associated with a letter cannot be said to be statistically significant to each other. Pairs that do not share membership on any cluster are significant at α = 0.05. These results do not show a significant difference among the Triples2 and Pairs2 test suites. However, we know that given the same parameters, Triples2 subsumes Pairs2. That is, Triples2 contains within it all the test cases of Pairs2, and as such we know that any additional coverage in Triples2 is beyond that covered by Pairs2. Figure 4 shows the subsumption relationship between the test suites, for every path in the graph each succeeding node contains all the test cases of every test suite in the preceding nodes. We are therefore particularly interested in the specific pairs of test suites where there is no subsumption relationship, and thus where their relative strengths are unknown. By examining the lattice structure shown in Figure 4, we determine that these three pairs do not have this subsumption relationship: (P airs1, Singles2), (Singles2, T riples1) and (P airs2, T riples1). At α = 0.05, Tukey’s test shows that T riples1 is significantly better than Singles2 in all code coverage measures, and two mutation coverage measures: “change arithmetic operators”, and “change comparison operators”. P airs1 is significantly better than Singles2 with the statement cov- 1 if evaluated to either true only, or false only; 2 if evaluated to true at one point and false at another point in time Condition/decision extends the idea of condition coverage by considering complex boolean expressions that are composed of several boolean variable inputs. The condition/decision coverage criteria states that not only does the decision have to be true at some point and false at another, every input to the decision should also be true at least once and false at least once. Thus, for example if the conditional is A && B || C, our condition/decision coverage counts 8 possible points, a maximum of two points for every input variable, and another two points for the decision itself. In addition to code coverage, Sulu also provides a mutation analysis tool. The tool allows programmers to add mutation operators at will, but we implemented the following: 1. Change an integer constant to one of 0, 1, −1, and its negation 2. Change an arithmetic operator to another arithmetic operator 3. Change a comparison operator to another comparison operator 4. Change a boolean operator to another boolean operator 5. Force an if statement or while loop to evaluate to either true or false 6. Delete a statement The first five are variants of Offutt’s set of sufficient mutation operators [20], the last one is taken from Andrews [2], where the “delete statement” operator is added to the set. Table 4 show the number of mutants generated per component for every mutation operator. Every generated test suite is run against every mutant. When the tests all pass the mutant is said to have survived, meaning the test suite is not strong enough to catch the mutant. If one of the tests fail, the testing is terminated, and the mutant is deemed killed. We then gather the number of mutants killed versus the total number of mutants generated. 3.4. Results Figure 3 shows a parallel coordinate graph code and mutation coverage for each test suite aggregated over all the components. Lines labeled statement, condition, and cond./dec depict percent code covered, the other axes are 66 Concept BinaryTree List Map Sorter Sorter Sorter Sorter Stack Vector Vector Realization Standard TwoStacks Hashtable BubbleSort HeapBased ListBased MinFinding LinkedList ArrayBased LinkedList TOTAL methods 3 5 9 3 5 5 5 5 18 18 76 statements 19 15 133 29 63 18 13 14 139 157 600 decision count 2 6 44 10 22 6 4 0 60 50 204 c/d count 4 12 116 20 46 14 8 0 138 120 478 Table 3. Code Coverage information for Sulu components Component Concept Realization Sorter MinFinding Stack LinkedList List TwoStacks Sorter ListBased BinaryTree Standard Sorter BubbleSort Sorter HeapBased Map Hashtable Vector ArrayBased Vector LinkedList TOTAL del stmt 13 13 15 18 19 28 63 133 139 156 597 chg arith 4 0 4 4 8 8 8 60 88 52 236 Mutants chg comp chg bool 10 0 0 0 15 0 10 3 5 0 20 1 60 1 85 23 140 13 115 15 460 56 chg const 2 6 0 2 8 13 17 103 100 57 308 force t/f 4 0 6 6 2 10 22 44 60 50 204 Table 4. Number of mutants generated for each component and mutation operator 100 90 Triples2 80 Pairs2 70 Triples1 60 Pairs1 50 Singles2 40 Singles1 30 20 Figure 3. Aggregate code and mutation coverage 67 ChgBool ChgInt ChgComp ChgArith ForceTF DelStmt Cond./Dec. Decision 0 Statement 10 statement Triples2 A Pairs2 A Triples1 A Pairs1 A Singles2 B Singles1 B decision 90.3 85.0 82.7 77.8 59.7 54.7 Triples2 Pairs2 Triples1 Pairs1 Singles2 Singles1 delete statement Triples2 A Pairs2 A Triples1 B Pairs1 B Singles2 B C Singles1 C A A A B B B 78.6 69.2 46.6 45.6 31.2 24.3 C C D D 85.8 77.5 69.6 62.7 42.2 32.5 Triples2 Pairs2 Triples1 Pairs1 Singles2 Singles1 Triples2 Pairs2 Triples1 Pairs1 Singles2 Singles1 condition/decision A A B A B B C C D D chg. arithmetic op A A B B C C D D E E chg. comparison op Triples2 A Pairs2 A Triples1 B Pairs1 B C Singles2 C D Singles1 D 66.3 55.4 35.2 31.7 18.3 14.6 Triples2 Pairs2 Triples1 Pairs1 Singles2 Singles1 force t/f A A B B B C C 75.5 65.7 43.1 39.7 25.5 19.1 chg. integer constant Triples2 A Pairs2 A B Triples1 B Pairs1 B C Singles2 C D Singles1 D 65.3 53.6 45.8 39.0 25.6 19.2 chg. boolean op Triples2 A Pairs2 A B Triples1 A B C Pairs1 A B C Singles2 B C Singles1 C 78.6 73.2 60.7 58.9 46.4 39.3 85.1 78.0 68.8 62.9 43.5 33.7 74.6 66.9 48.7 42.4 28.4 15.7 Table 5. Tukey’s test results for each comparison metric; numbers on right of tables are percent covered; test suites not connected by the same letter are significantly different at α = 0.05 68 Coverage metric statement decision condition/decision delete statement force T/F change arithmetic op change comparison op change int constant change boolean op Triples2 Triples1 Pairs2 Pairs1 Singles2 Singles1 Singles2 59.7 42.2 43.5 31.1 25.4 28.4 18.2 25.6 46.4 Pairs1 *77.8 62.7 63.0 45.6 39.7 42.4 31.7 38.9 58.9 Triples1 *82.7 *69.6 *68.8 46.6 43.1 *48.7 *35.2 45.8 60.7 These threats relate mainly to the size, number, and composition of the main inputs to the study. Internal validity is threatened by the small sample size. With just 10 components we could only conclude that the all-triples with 2 parameter values test suite is better than every other test suite in 3 of the 9 metrics. And we found no significant differences among the test suites for the “change boolean operator” mutant kill ratio. A larger experiment may be used as further confirmation of our conclusions. External validity is affected by the type and size of the software components we used. The set of reference components used here are collection classes; which suggests that the automated test case generations strategy works well for software components similar to those in our reference set, but it may not be representative of other kinds of software components (e.g., components that deal with I/O). The reference components themselves are small, with relatively few lines of code. Our conclusions on the effectiveness of the test suites may not be applicable to much larger software units. A larger experiment with more varied kinds of components may tell us whether our conclusion still holds for a larger population of software. Construction validity may be threatened by the coverage metrics we use; that is, our coverage measures may not represent a measure that is related to real bugs. However, there is evidence from work by Andrews and his colleagues [1, 2] that our mutation operators are correlated to real bugs. While often criticized for not being a strong measure of the test effort, our code coverage metrics are measures that are nevertheless widely used in industry. A further threat to external validity is the practicality of applying these techniques to real-world software development practices. One of the major difficulties of our techniques is time: the number of test cases we generate increases exponentially as we increase the length of the se- Table 6. Percent covered: Singles2 versus P airs1 and T riples1; starred values indicate significant difference against Singles2 at α = 0.05 erage measure. Table 6 summarizes the result of comparing Singles2 against T riples1, and Singles2 against P airs1. For T riples1 and P airs2, there is no significant difference between these test suites with the code coverage metrics; but P airs2 is significantly better than T riples1 in three of the mutation metrics: delete statement, change comparison operators, and force conditionals to true or false. This is indicative of the balance that must be struck between the length of the method call sequences and the number of parameter values that is passed into the methods. We summarize the comparison of P airs2 against T riples1 in Table 7. 3.5 Pairs2 85.0 77.5 78.0 *69.2 *65.7 66.9 *55.4 53.6 71.4 Table 7. Percent covered: T riples1 versus P airs2; starred values indicate measures where P airs2 is significantly better at α = 0.05 Figure 4. Subset relationships between test suites Coverage metric statement decision condition/decision delete statement force T/F change arithmetic op change comparison op change int constant change boolean op Triples1 82.7 69.6 68.8 46.6 43.1 48.7 35.2 45.8 60.7 Threats to Validity No experiment is perfect; it is the responsibility of the researcher to identify the threats to validity of their research. 69 quence and the number of parameter values. In addition, mutation testing is also time intensive. In the next section, we explore some ways of managing the cost of running and measuring the effectiveness of tests. mechanism on production-level software (and uncovered heretofore unknown bugs). Evaluating test case generators in this manner can be a very convincing mechanism for measuring the effectiveness of a test case generation strategy in general. The problem arises however, of the necessity of measuring the thoroughness of a particular test suite for a particular software component. That is a programmer may well ask whether a test case generation strategy, while perhaps measured to be “good” for a reference set of components, is actually effective for the particular software he is actually testing. Leitner et al. [17] also emphasized in their work on Autotest the integration between human-written tests and automatically generated ones. Our research emphasizes the role of the evaluation of test cases to guide the direction of manual tests. 4. Related Work Various automated test case generation strategies have been proposed and implemented over the years [4, 7, 8, 11, 17]. Our approach to generating unit tests extends earlier work by one of the authors [12]. The test generation strategy in this paper can be considered to be a form of bounded exhaustive testing [18, 22]. However, Marinov and Kurshdid, as well as Sullivan et al. generate complex structures directly, while our generation strategy is more fundamentally black-box, where complex structures may only arise as a consequence of a sequence of method calls. Our runtime checking of specifications is most accurately described as design-by-contract, following Meyer [19]. Our use design-by-contract specifications as a test oracle engine employs the technique used in Cheon and Leaven’s [8] work on JMLUnit. Other mechanisms for detecting the differences between specification and implementation have been proposed, including using the execution of algebraic specifications [3]. Enumerating all sequences of three method calls, while showing good coverage, is also very expensive. It would benefit the tester if we were somehow able to prune the number of test cases but still have the same coverage. One way to do this is to remove redundant test cases. Invoking a method call more than once on the same object state is redundant; Xie and colleagues [25], devised a framework for determining equivalent object states, which may be applied as future work to eliminate redundant test cases. Another mechanism for pruning test cases is to pre-identify invalid sequences of method calls. Cheon and Permundla [9] devised an approach to specify allowed sequences of method calls, variations of which may be useful as a future addition to our specification language. Measuring how thoroughly test cases exercise the software under test is a venerable problem in software testing. Zhu et al. [26] provide a solid overview of various test adequacy criteria, including code coverage, and mutation coverage. Andrews and his colleagues [2] report some correspondence between high mutant kill ratios and the ability of test cases to find real bugs. We adopted the set of mutants reported in the work by Andrews et al. (in turn derived from the work by Offutt et al. [20]) in this research. Alternatively, the effectiveness of test cases may be evaluated by running test cases against software components with bugs that appear “in the wild” For example, d’Amorim [10] and his colleagues use student-created software, and Leitner and his colleagues [17] ran their automated test 5. Conclusion The Sulu language and tools provides an integrated view of unit testing. It is a view where the various stages of testing: generation, execution, and evaluation, is integrated into a single platform. It is also a view where the testing process is integrated with the programming language, where testing issues are tackled from the design of the language, up to the execution of the software written in that language. The result of the experiment shows that our approach to automated unit testing, even for the fairly basic strategy of enumerating all sequences of method calls of a certain short length, can produce high coverage scores for many test adequacy criteria. The automatically generated test suites provide a baseline set of tests that augments manual testing. Automated unit testing frees the test developer to concentrate on the more complex bugs not covered by the automated tests. This is also part of the integrated vision of Sulu, where human and mechanically generated tests complement each other to provide better quality software. References [1] J. H. Andrews, L. C. Briand, and Y. Labiche. Is mutation an appropriate tool for testing experiments? In ICSE ’05: Proceedings of the 27th International Conference on Software Engineering, pages 402–411, 2005. [2] J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin. Using mutation analysis for assessing and comparing testing coverage criteria. IEEE Transactions on Software Engineering, 32(8):608–624, 2006. [3] S. Antoy and D. Hamlet. Automatically checking an implementation against its formal specification. IEEE Transactions on Software Engineering, 26(1):55–69, 2000. [4] C. Boyapati, S. Khurshid, and D. Marinov. Korat: Automated testing based on Java predicates. In ISSTA ’02: Proceedings of the 2002 ACM SIGSOFT International Sympo- 70 [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] sium on Software Testing and Analysis, pages 123–133, New York, NY, USA, 2002. ACM Press. K. Bruce, L. Cardelli, G. Castagna, G. T. Leavens, and B. Pierce. On binary methods. Theory and Practice of Object Systems, 1(3):221–242, 1995. K. B. Bruce, A. Schuett, R. van Gent, and A. Fiech. Polytoil: A type-safe polymorphic object-oriented language. ACM Transactions on Programming Languages and Systems, 25(2):225–290, 2003. Y. Cheon. Automated random testing to detect specificationcode inconsistencies. Technical Report 07-07, The University of Texas at El Paso, February 2007. Y. Cheon and G. T. Leavens. A simple and practical approach to unit testing: The JML and JUnit way. In ECOOP ’02: Proceedings of the 16th European Conference on Object-Oriented Programming, pages 231–255, London, UK, 2002. Springer-Verlag. Y. Cheon and A. Perumandla. Specifying and checking method call sequences of Java programs. Software Quality Journal, 15(1):7–25, 2007. M. d’Amorim, C. Pacheco, T. Xie, D. Marinov, and M. D. Ernst. An empirical comparison of automated generation and classification techniques for object-oriented unit testing. In ASE ’06: Proceedings of the 21st IEEE International Conference on Automated Software Engineering (ASE’06), pages 59–68, Washington, DC, USA, 2006. IEEE Computer Society. R.-K. Doong and P. G. Frankl. The ASTOOT approach to testing object-oriented programs. ACM Transactions on Software Engineering Methodology, 3(2):101–130, 1994. S. H. Edwards. Black-box testing using flowgraphs: An experimental assessment of effectiveness and automation potential. Software Testing, Verification and Reliability, 10(4):249–262, 2000. M.-C. Gaudel. Testing can be formal, too. In TAPSOFT ’95: Proceedings of the 6th International Joint Conference CAAP/FASE on Theory and Practice of Software Development, pages 82–96, London, UK, 1995. Springer-Verlag. D. E. Harms and B. W. Weide. Copying and swapping: Influences on the design of reusable software components. IEEE Transactions on Software Engineering, 17(5):424– 435, May 1991. K. J. Hayhurst, D. S. Veerhusen, J. J. Chilenski, and L. K. Rierson. A practical tutorial on modified condition/decision coverage. Technical report, NASA, 2001. G. T. Leavens, A. L. Baker, and C. Ruby. JML: A notation for detailed design. In H. Kilov, B. Rumpe, and I. Simmonds, editors, Behavioral Specifications of Businesses and Systems, pages 175–188. Kluwer Academic Publishers, 1999. A. Leitner, I. Ciupa, B. Meyer, and M. Howard. Reconciling manual and automated testing: The autotest experience. In HICSS ’07: Proceedings of the 40th Annual Hawaii International Conference on System Sciences, page 261a, Washington, DC, USA, 2007. IEEE Computer Society. D. Marinov and S. Khurshid. TestEra: A novel framework for automated testing of Java programs. In ASE ’01: Proceedings of the 16th IEEE International Conference on Automated Software Engineering, page 22, Washington, DC, USA, 2001. IEEE Computer Society. [19] B. Meyer. Applying ’design by contract’. Computer, 25(10):40–51, 1992. [20] A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and C. Zapf. An experimental determination of sufficient mutant operators. ACM Transactions on Software Engineering Methodology, 5(2):99–118, 1996. [21] M. Sitaraman and B. Weide. Component-based software using RESOLVE. SIGSOFT Software Engineering Notes, 19(4):21–22, 1994. [22] K. Sullivan, J. Yang, D. Coppit, S. Khurshid, and D. Jackson. Software assurance by bounded exhaustive testing. In ISSTA ’04: Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis, pages 133–142, New York, NY, USA, 2004. ACM. [23] B. W. Weide, W. F. Ogden, and M. Sitaraman. Recasting algorithms to encourage reuse. IEEE Software, 11(5):80– 88, 1994. [24] B. W. Weide, S. M. Pike, and R. Hinze. Why swapping? In Proceedings of the RESOLVE Workshop 2002. Virginia Tech, 2002. [25] T. Xie, D. Notkin, and D. Marinov. Rostra: a framework for detecting redundant object-oriented unit tests. Automated Software Engineering, 2004. Proceedings. 19th International Conference on, pages 196–205, 2004. [26] H. Zhu, P. A. V. Hall, and J. H. R. May. Software unit test coverage and adequacy. ACM Computing Surveys, 29(4):366–427, 1997. 71