2008 International Conference on Software Testing, Verification, and Validation
Evaluating Automated Unit Testing in Sulu
Roy Patrick Tan and Stephen Edwards
Department of Computer Science
660 McBryde Hall (0106)
Blacksburg, Virginia
{rtan, edwards}@cs.vt.edu
Abstract
tomated unit testing tools for programs written in this language. The Sulu programming language provides a novel
outlook, it was developed with a constant eye towards its
effect on unit testing.
In this paper we discuss some of the main features of the
Sulu language, and the tools developed for automating the
process of unit testing in Sulu. We then describe an experiment to assess the practical effectiveness of our approach,
by running automatically generated test suites against a reference set of small software components. We evaluate the
resulting data, and then conclude with some discussion of
the results of the experiment, as well as some related work.
Sulu is a programming language designed with automated unit testing specifically in mind. One aim of Sulu
is to demonstrate how automated software testing can be
more integrated into current software development processes. Sulu’s runtime and tools support automated testing from end to end; automating the generation, execution,
and evaluation of test suites using both code coverage and
mutation analysis. To show the effectiveness of this integrated approach, we performed an experiment to evaluate
a family of test suites generated using a test case generation
algorithm which exhaustively enumerates every sequence of
method calls within a certain bound. The results show high
code coverage, including 90% statement coverage and high
mutation coverage for the most comprehensive test suite
generated.
2. The Sulu Language and Tools
Sulu is a programming language influenced mainly by
the Resolve [21] language and discipline and the Java Modeling Language (JML) [16], particularly work by Cheon and
his colleagues [8] on the JMLUnit method for automated
unit testing. Sulu is named after an island in the Philippines; in pre-colonial times, Sulu was a center for barter
trade in Southeast Asia, and the authors thought it apt, since
swapping [14, 24] is one of the language’s major features.
While many of the testing techniques in this paper do not
require Sulu, we designed the language to bring together
disparate elements of a truly automated testing system. The
Sulu system includes an interpreter for the language, a pluggable mechanism for automatic test case generation, a test
execution tool that uses design by contract specification as a
test oracle, a code coverage profiler, and an extensible mutation testing tool.
Figure 1 is a diagram of the overall architecture of the
Sulu tools for automated testing. The only input required
by these tools is the component under test, which is written
in the Sulu programming language. A Sulu software component is composed of a design-by-contract specification of
its own behavioral requirements (called a concept), and its
implementation.
1. Introduction
The essence of software testing is the comparison of the
actual execution of a piece of software against that piece
of software’s expected behavior. As such, any attempt at
automating the whole process of unit testing involves the
mechanical generation of test cases that will exercise the
software unit, the execution of these test cases, and an automated mechanism for determining whether the software
behaved as expected. However, to be of any practical use to
the software development professional, he must also be able
to measure the thoroughness of the automatically generated
test suite.
To realize the vision of an integrated platform for automated testing, the authors have designed a programming
language called Sulu, implemented an interpreter for it, and
constructed automated testing tools that can generate tests
automatically, run these tests, and evaluate these tests via
code coverage and mutation analysis metrics. The goal in
the development of Sulu is to facilitate the integration of au-
978-0-7695-3127-4/08
0-7695-3127-X/08
$25.00
$25.00
© 2008
© 2008
IEEE
IEEE
DOI 10.1109/ICST.2008.59
62
can be run against the implementation (realization) by the
test execution tool. In Sulu, test case generators are implemented as plug-ins where the developer can construct his
own test case generators. For this research, we have built a
test case generator that can generate a family of test suites
by exhaustive enumeration of method calls. Both automatically generated test suites and manually written ones are in
the same format, and thus the text case execution tool can
run either manually, or automatically generated test suites
with equal ease.
As a test case is being executed, we have to determine
whether the software under test has behaved correctly or
not (i.e., the oracle problem [13]). We use the design-bycontract postconditions as the test oracle. If a postcondition
fails because of a test case, we flag that test case also as
failed, i.e. a bug was found. If the postcondition does not
fail, we say that the test case passed, the software under test
behaved correctly. In addition to classifying a test case as
passed or failed, Sulu also classifies some test cases as invalid. A test case is invalid when it exercises the component
under test in a manner that is prohibited by its specification,
i.e., it violates a precondition.
Figure 1. The Sulu automated testing system
2.1. Specifying Software in Sulu
Figure 2 is an example of a specification for a typical
Stack component. The specification for the stack here is
fairly standard; we model a stack component as a mathematical sequence of items, with the top of the stack as the
first element of the sequence. The initially section on
line 3 provides the initial state of the model: an empty sequence.
After every method signature, one can put in design-bycontract preconditions in a requires clause, and postconditions in an ensures clause. The contract for the pop method
on line 9, for example, requires that the sequence is not
empty when the method is called, and ensures that the sequence in the pre-state (the old sequence) is the same as the
sequence in the post-state with the popped element added to
the beginning.
The old(...) construct is used to denote that the expression encapsulated between the parentheses is evaluated
in the state before the method is called. Note that this is
not a function call. All procedures in Sulu are attached to
objects as methods. Although not shown in the figure, Sulu
also supports class invariants with an invariant section.
A property of these specifications is that they can be executed. When we model a stack as a sequence of items, it
is possible for us to actually construct a programmatic sequence of items such that we can check, for example that the
pop method actually decreases the length of the sequence
by one, and produces the right object.
A Sulu concept is similar to a Java interface, it defines
the method signatures that must be implemented; and there
can be many differing implementations of that concept. An
implementation in Sulu is called a realization.
2.3. Evaluation of Test Suites
After test cases are generated and executed, we want
to have some metric of the effectiveness of the test suite.
Sulu uses two different test coverage criteria to measure
this, code coverage and mutation coverage. The code
coverage profiler measures statement, decision, and condition/decision [15] coverage.
Mutation analysis involves seeding bugs into otherwise
acceptable code. Typically, a program is transformed via a
mutation operator that produces a set of programs each of
which is identical to the original aside from a single change.
A mutation operator may change a plus to a minus, for example. If you had a program that contained three addition
operators, the mutant generator would generate three programs, for each one converting an addition to a subtraction.
A mutation analysis tool will run each test suite against
every mutant; if the test suite detected the bug (a test case
failed), the mutant is said to be killed. Every mutation operator may be capable of producing a large number of mutants, and thus a judicious selection of mutation operators
must be used to make mutation analysis feasible.
The Sulu mutation analysis framework is extensible; it
allows programmers to implement their own mutation operators by extending an iterator-like abstract class that allows
the mutation analysis tool to access every generated mutant.
For the purposes of this research, however, we implemented
close variants of a set of mutation operators reported by Andrews and his colleagues [2] that have some evidence of
corresponding to real bugs.
2.2. Tools for Generation and Execution
Automatic test case generators then take the specification (concept) and produce one or more test suites, which
63
1
concept Stack( Item ) {
2
initially this.getSequence().length() == 0;
3
4
method push( element: Item )
ensures this.getSequence() ==
old( this.getSequence().insertFirst(element.getMathObject()) );
5
6
7
8
method pop( element: Item )
requires this.getSequence().length() > 0
ensures old( this.getSequence() ) ==
this.getSequence().insertFirst( element.getMathObject() );
9
10
11
12
13
method length(): Int
ensures length.equals( this.getSequence().length() );
14
15
16
model method getSequence(): concept Sequence(concept MathObject() );
17
18
19
}
Figure 2. A Stack a specification
3. Evaluating The Effectiveness of Exhaustive
Enumeration of Method Sequences
for every parameter of most Sulu method calls. It is often
the case, however, that different parameter values may result in different behavior, and thus uncover more bugs. Our
test case generator thus can generate test cases that enumerate all n-sequences of method calls for k possible different
parameter values for every formal parameter in a method
call. That is, if there are p formal parameters in a sequence
of method calls, we generate pk test cases for that sequence.
Because of this combinatorial explosion problem, we can
only generate test cases where n and k are low.
We also encounter the problem of generating parameter values other than the default value for every parameter.
Currently, our test case generator generates stubs where the
testing professional can fill in additional parameter values
for every parameter type encountered in the software component.
We generated six different test suites from our test case
generator. We generated test suites corresponding to each
test case being a single method call, a pair of method calls,
and three method calls; and for each of those, a test suite
for one parameter value, and a test suite for two parameter
values for each formal parameter. For the rest of this paper,
we name each test suite by the words “Singles”, “Pairs”,
and “Triples” to denote the length of the method sequence,
followed by a number to denote the number of values passed
for each parameter. Thus “Triples1” denotes a test suite
where every test case is a sequence of 3 method calls, and
one value is passed for every parameter.
Table 1 shows the number of test cases generated for
each test suite. Because we are generating test suites from
method signatures, test suites generated for different real-
Once the software tools described in the previous section
were constructed, we were then able to perform an evaluation of our automated unit testing strategy. Our evaluation
involves these steps:
1. Implement a test generation algorithm
2. Select a set of reference software components
3. Generate test suites for every software unit
4. Execute test suites and gather code coverage information
5. Select (and implement, if necessary) a set of mutation
operators
6. Run every relevant test suite against each mutant and
gather mutation coverage
3.1. Test Case Generation
The Sulu interpreter allows for different implementations of test case generators. The implementation we used
for this experiment is a test case generator that could generate a test suite that exhaustively enumerates every sequence
of n method calls given a number n. In addition to the sequence of method calls, a test case needs to provide inputs
to methods that have parameters. Since every Sulu variable
has a default value, we can automatically provide one value
64
Concept
BinaryTree
List
Map
Stack
Sorter
Vector
Singles1
3
5
9
3
5
18
Pairs1
9
25
81
9
25
324
Triples1
27
125
729
27
125
5832
Singles2
17
11
17
5
7
41
Pairs2
289
121
239
25
49
1681
Triples2
4913
1331
4913
125
343
68921
Table 1. Number of tests generated for each software component
Concept
BinaryTree
List
Map
Stack
Sorter
Sorter
Sorter
Sorter
Vector
Vector
Realization
Standard
TwoStacks
Hashtable
LinkedList
BubbleSort
HeapBased
ListBased
MinFinding
ArrayBased
LinkedList
and software units derived from java.util classes. The
binary tree, list, stack and sorting machines come from the
Resolve heritage, while the vector and hashtable components are based on Java.
The Sorter components recasts the sorting algorithm into
a sorting machine that makes it more component-based, following loosely the work by Weide et al. [23]. Typical usage
of these components would be calling the insert method
for every item in the list to be sorted, then calling sort,
then finally calling remove repeatedly, where remove removes the smallest element in the list.
There are four different sorting algorithms implemented,
representing different strategies of where the work of the
sorting is distributed. BubbleSort keeps the objects in
arbitrary order when inserted, and does all the work on the
call to sort. ListBased keeps the list in sorted order
on insert (essentially an insertion sort). The MinFinding
realization keeps the list in arbitrary order, and locates
the smallest item on every call to remove. And finally
HeapBased keeps the objects in a heap structure, thus the
work is distributed between the insert and the remove
methods.
The remaining three components, the Hashtable based
map and the two Vector components are based on the
java.util library. The Vector components implement
most of the Java List interface, while the Hashtable implements most of the Java Map interface.
Table 2. Sulu reference components used for
evaluation
izations of the same concept are essentially identical.
When generating test suites where two possible parameter values are entered for every parameter, the programmer
must fill in two example values per parameter type. For arbitrary objects, we use integers as the actual type, and the
set of parameter values are either 0 or 1. Some methods of
the components we tested also took selftype parameters
[5, 6]; that is, parameter values that have the same type as
the object from which the method is called. Since these are
all collection objects, for the selftype values, we use one of
either the default value (an empty collection), or the default
value with 0 inserted in it.
3.3. Evaluation Criteria
3.2. Component Selection
Sulu provides tools to evaluate the effectiveness of the
automated testing tools using both code coverage and mutation analysis. There are three different code coverage criteria that are measured: statement coverage, decision coverage, and condition/decision coverage [15]. Statement coverage is simply a count of the number of statements executed versus the total number of statements in the component under test. Decision coverage means that it counts
whether for every boolean statement in an if statement and
while loop, it evaluates to true at least once, and false at
least once. That is, every while loop or if statement can
be given a maximum of 2 points:
To be able to evaluate test suites generated by this strategy, we need a set of reference software components to use
as inputs to the test case generation. Table 2 lists the ten
Sulu components (concepts and realizations) used in this
evaluation.
These software components are collection abstract data
types, meant to represent some of the most common data
structures and algorithms used by programmers. These
components can broadly be separated into two sets, components derived from the Resolve family of components,
65
0 for unevaluated conditionals;
in percent mutants killed. The results show what we expected: All pairs of method calls is superior to all single
method calls, and all triples cover more than all pairs. Also,
a test suite that takes in two parameters does better than
the corresponding test suite that only takes one parameter
value. Quite notable is the fact that all triples with two parameters have 90% statement coverage, and close to 85%
decision and condition/decision coverage. Our most comprehensive test suites also achieve high coverage, with kill
ratios of above or nearly 75% for four of the six mutation
operators.
In addition, we performed statistical analyses to answer
the question of how likely these code coverage difference
were due to chance. To determine at a 95% confidence level
that the differences in coverage is not from random chance,
we performed a one-way analysis of variance. At α = 0.05,
the analysis of variance found significant differences exist
between the test suites on all our coverage measures. We
also performed a Tukey’s test as a post-hoc analysis on every pair of test suites, for every coverage measure; this allows us to determine which specific pairs of test suites are
significantly different.
Table 5 shows the results of every pairwise Tukey’s test.
Each sub-table represents Tukey’s test performed on all
pairs of test suites on a certain measure of thoroughness.
The numbers to the right of each table is the cumulative percent of coverage. Each column of letters represent cluster
of means where we cannot reject the null hypothesis. That
is, for each column, the test suites that are associated with
a letter cannot be said to be statistically significant to each
other. Pairs that do not share membership on any cluster are
significant at α = 0.05.
These results do not show a significant difference among
the Triples2 and Pairs2 test suites. However, we know that
given the same parameters, Triples2 subsumes Pairs2. That
is, Triples2 contains within it all the test cases of Pairs2, and
as such we know that any additional coverage in Triples2 is
beyond that covered by Pairs2.
Figure 4 shows the subsumption relationship between
the test suites, for every path in the graph each succeeding node contains all the test cases of every test suite in the
preceding nodes. We are therefore particularly interested in
the specific pairs of test suites where there is no subsumption relationship, and thus where their relative strengths
are unknown. By examining the lattice structure shown
in Figure 4, we determine that these three pairs do not
have this subsumption relationship: (P airs1, Singles2),
(Singles2, T riples1) and (P airs2, T riples1).
At α = 0.05, Tukey’s test shows that T riples1 is significantly better than Singles2 in all code coverage measures,
and two mutation coverage measures: “change arithmetic
operators”, and “change comparison operators”. P airs1 is
significantly better than Singles2 with the statement cov-
1 if evaluated to either true only, or false only;
2 if evaluated to true at one point and false at another
point in time
Condition/decision extends the idea of condition coverage by considering complex boolean expressions that are
composed of several boolean variable inputs. The condition/decision coverage criteria states that not only does the
decision have to be true at some point and false at another,
every input to the decision should also be true at least once
and false at least once. Thus, for example if the conditional
is A && B || C, our condition/decision coverage counts
8 possible points, a maximum of two points for every input
variable, and another two points for the decision itself.
In addition to code coverage, Sulu also provides a mutation analysis tool. The tool allows programmers to add
mutation operators at will, but we implemented the following:
1. Change an integer constant to one of 0, 1, −1, and its
negation
2. Change an arithmetic operator to another arithmetic
operator
3. Change a comparison operator to another comparison
operator
4. Change a boolean operator to another boolean operator
5. Force an if statement or while loop to evaluate to either
true or false
6. Delete a statement
The first five are variants of Offutt’s set of sufficient mutation operators [20], the last one is taken from Andrews [2],
where the “delete statement” operator is added to the set.
Table 4 show the number of mutants generated per component for every mutation operator.
Every generated test suite is run against every mutant.
When the tests all pass the mutant is said to have survived,
meaning the test suite is not strong enough to catch the mutant. If one of the tests fail, the testing is terminated, and the
mutant is deemed killed. We then gather the number of mutants killed versus the total number of mutants generated.
3.4. Results
Figure 3 shows a parallel coordinate graph code and
mutation coverage for each test suite aggregated over all
the components. Lines labeled statement, condition, and
cond./dec depict percent code covered, the other axes are
66
Concept
BinaryTree
List
Map
Sorter
Sorter
Sorter
Sorter
Stack
Vector
Vector
Realization
Standard
TwoStacks
Hashtable
BubbleSort
HeapBased
ListBased
MinFinding
LinkedList
ArrayBased
LinkedList
TOTAL
methods
3
5
9
3
5
5
5
5
18
18
76
statements
19
15
133
29
63
18
13
14
139
157
600
decision count
2
6
44
10
22
6
4
0
60
50
204
c/d count
4
12
116
20
46
14
8
0
138
120
478
Table 3. Code Coverage information for Sulu components
Component
Concept
Realization
Sorter
MinFinding
Stack
LinkedList
List
TwoStacks
Sorter
ListBased
BinaryTree Standard
Sorter
BubbleSort
Sorter
HeapBased
Map
Hashtable
Vector
ArrayBased
Vector
LinkedList
TOTAL
del stmt
13
13
15
18
19
28
63
133
139
156
597
chg arith
4
0
4
4
8
8
8
60
88
52
236
Mutants
chg comp chg bool
10
0
0
0
15
0
10
3
5
0
20
1
60
1
85
23
140
13
115
15
460
56
chg const
2
6
0
2
8
13
17
103
100
57
308
force t/f
4
0
6
6
2
10
22
44
60
50
204
Table 4. Number of mutants generated for each component and mutation operator
100
90
Triples2
80
Pairs2
70
Triples1
60
Pairs1
50
Singles2
40
Singles1
30
20
Figure 3. Aggregate code and mutation coverage
67
ChgBool
ChgInt
ChgComp
ChgArith
ForceTF
DelStmt
Cond./Dec.
Decision
0
Statement
10
statement
Triples2 A
Pairs2
A
Triples1 A
Pairs1
A
Singles2
B
Singles1
B
decision
90.3
85.0
82.7
77.8
59.7
54.7
Triples2
Pairs2
Triples1
Pairs1
Singles2
Singles1
delete statement
Triples2 A
Pairs2
A
Triples1
B
Pairs1
B
Singles2
B C
Singles1
C
A
A
A
B
B
B
78.6
69.2
46.6
45.6
31.2
24.3
C
C
D
D
85.8
77.5
69.6
62.7
42.2
32.5
Triples2
Pairs2
Triples1
Pairs1
Singles2
Singles1
Triples2
Pairs2
Triples1
Pairs1
Singles2
Singles1
condition/decision
A
A B
A B
B C
C D
D
chg. arithmetic op
A
A B
B C
C D
D E
E
chg. comparison op
Triples2 A
Pairs2
A
Triples1
B
Pairs1
B C
Singles2
C D
Singles1
D
66.3
55.4
35.2
31.7
18.3
14.6
Triples2
Pairs2
Triples1
Pairs1
Singles2
Singles1
force t/f
A
A
B
B
B C
C
75.5
65.7
43.1
39.7
25.5
19.1
chg. integer constant
Triples2 A
Pairs2
A B
Triples1
B
Pairs1
B C
Singles2
C D
Singles1
D
65.3
53.6
45.8
39.0
25.6
19.2
chg. boolean op
Triples2 A
Pairs2
A B
Triples1 A B C
Pairs1
A B C
Singles2
B C
Singles1
C
78.6
73.2
60.7
58.9
46.4
39.3
85.1
78.0
68.8
62.9
43.5
33.7
74.6
66.9
48.7
42.4
28.4
15.7
Table 5. Tukey’s test results for each comparison metric; numbers on right of tables are percent
covered; test suites not connected by the same letter are significantly different at α = 0.05
68
Coverage metric
statement
decision
condition/decision
delete statement
force T/F
change arithmetic op
change comparison op
change int constant
change boolean op
Triples2
Triples1
Pairs2
Pairs1
Singles2
Singles1
Singles2
59.7
42.2
43.5
31.1
25.4
28.4
18.2
25.6
46.4
Pairs1
*77.8
62.7
63.0
45.6
39.7
42.4
31.7
38.9
58.9
Triples1
*82.7
*69.6
*68.8
46.6
43.1
*48.7
*35.2
45.8
60.7
These threats relate mainly to the size, number, and composition of the main inputs to the study.
Internal validity is threatened by the small sample size.
With just 10 components we could only conclude that the
all-triples with 2 parameter values test suite is better than
every other test suite in 3 of the 9 metrics. And we found no
significant differences among the test suites for the “change
boolean operator” mutant kill ratio. A larger experiment
may be used as further confirmation of our conclusions.
External validity is affected by the type and size of the
software components we used. The set of reference components used here are collection classes; which suggests
that the automated test case generations strategy works well
for software components similar to those in our reference
set, but it may not be representative of other kinds of software components (e.g., components that deal with I/O). The
reference components themselves are small, with relatively
few lines of code. Our conclusions on the effectiveness of
the test suites may not be applicable to much larger software units. A larger experiment with more varied kinds of
components may tell us whether our conclusion still holds
for a larger population of software.
Construction validity may be threatened by the coverage metrics we use; that is, our coverage measures may not
represent a measure that is related to real bugs. However,
there is evidence from work by Andrews and his colleagues
[1, 2] that our mutation operators are correlated to real bugs.
While often criticized for not being a strong measure of the
test effort, our code coverage metrics are measures that are
nevertheless widely used in industry.
A further threat to external validity is the practicality of
applying these techniques to real-world software development practices. One of the major difficulties of our techniques is time: the number of test cases we generate increases exponentially as we increase the length of the se-
Table 6. Percent covered: Singles2 versus
P airs1 and T riples1; starred values indicate
significant difference against Singles2 at α =
0.05
erage measure. Table 6 summarizes the result of comparing
Singles2 against T riples1, and Singles2 against P airs1.
For T riples1 and P airs2, there is no significant difference between these test suites with the code coverage
metrics; but P airs2 is significantly better than T riples1
in three of the mutation metrics: delete statement, change
comparison operators, and force conditionals to true or
false. This is indicative of the balance that must be struck
between the length of the method call sequences and the
number of parameter values that is passed into the methods.
We summarize the comparison of P airs2 against T riples1
in Table 7.
3.5
Pairs2
85.0
77.5
78.0
*69.2
*65.7
66.9
*55.4
53.6
71.4
Table 7. Percent covered: T riples1 versus
P airs2; starred values indicate measures
where P airs2 is significantly better at α =
0.05
Figure 4. Subset relationships between test
suites
Coverage metric
statement
decision
condition/decision
delete statement
force T/F
change arithmetic op
change comparison op
change int constant
change boolean op
Triples1
82.7
69.6
68.8
46.6
43.1
48.7
35.2
45.8
60.7
Threats to Validity
No experiment is perfect; it is the responsibility of the researcher to identify the threats to validity of their research.
69
quence and the number of parameter values. In addition,
mutation testing is also time intensive. In the next section,
we explore some ways of managing the cost of running and
measuring the effectiveness of tests.
mechanism on production-level software (and uncovered
heretofore unknown bugs). Evaluating test case generators in this manner can be a very convincing mechanism for
measuring the effectiveness of a test case generation strategy in general. The problem arises however, of the necessity of measuring the thoroughness of a particular test suite
for a particular software component. That is a programmer
may well ask whether a test case generation strategy, while
perhaps measured to be “good” for a reference set of components, is actually effective for the particular software he
is actually testing.
Leitner et al. [17] also emphasized in their work on Autotest the integration between human-written tests and automatically generated ones. Our research emphasizes the
role of the evaluation of test cases to guide the direction of
manual tests.
4. Related Work
Various automated test case generation strategies have
been proposed and implemented over the years [4, 7, 8, 11,
17]. Our approach to generating unit tests extends earlier
work by one of the authors [12]. The test generation strategy in this paper can be considered to be a form of bounded
exhaustive testing [18, 22]. However, Marinov and Kurshdid, as well as Sullivan et al. generate complex structures
directly, while our generation strategy is more fundamentally black-box, where complex structures may only arise
as a consequence of a sequence of method calls.
Our runtime checking of specifications is most
accurately described as design-by-contract, following
Meyer [19]. Our use design-by-contract specifications as
a test oracle engine employs the technique used in Cheon
and Leaven’s [8] work on JMLUnit. Other mechanisms for
detecting the differences between specification and implementation have been proposed, including using the execution of algebraic specifications [3].
Enumerating all sequences of three method calls, while
showing good coverage, is also very expensive. It would
benefit the tester if we were somehow able to prune the
number of test cases but still have the same coverage. One
way to do this is to remove redundant test cases. Invoking a
method call more than once on the same object state is redundant; Xie and colleagues [25], devised a framework for
determining equivalent object states, which may be applied
as future work to eliminate redundant test cases. Another
mechanism for pruning test cases is to pre-identify invalid
sequences of method calls. Cheon and Permundla [9] devised an approach to specify allowed sequences of method
calls, variations of which may be useful as a future addition
to our specification language.
Measuring how thoroughly test cases exercise the software under test is a venerable problem in software testing.
Zhu et al. [26] provide a solid overview of various test adequacy criteria, including code coverage, and mutation coverage. Andrews and his colleagues [2] report some correspondence between high mutant kill ratios and the ability of
test cases to find real bugs. We adopted the set of mutants
reported in the work by Andrews et al. (in turn derived from
the work by Offutt et al. [20]) in this research.
Alternatively, the effectiveness of test cases may be evaluated by running test cases against software components
with bugs that appear “in the wild” For example, d’Amorim
[10] and his colleagues use student-created software, and
Leitner and his colleagues [17] ran their automated test
5. Conclusion
The Sulu language and tools provides an integrated view
of unit testing. It is a view where the various stages of testing: generation, execution, and evaluation, is integrated into
a single platform. It is also a view where the testing process
is integrated with the programming language, where testing
issues are tackled from the design of the language, up to the
execution of the software written in that language.
The result of the experiment shows that our approach to
automated unit testing, even for the fairly basic strategy of
enumerating all sequences of method calls of a certain short
length, can produce high coverage scores for many test adequacy criteria. The automatically generated test suites provide a baseline set of tests that augments manual testing.
Automated unit testing frees the test developer to concentrate on the more complex bugs not covered by the automated tests. This is also part of the integrated vision of
Sulu, where human and mechanically generated tests complement each other to provide better quality software.
References
[1] J. H. Andrews, L. C. Briand, and Y. Labiche. Is mutation an
appropriate tool for testing experiments? In ICSE ’05: Proceedings of the 27th International Conference on Software
Engineering, pages 402–411, 2005.
[2] J. H. Andrews, L. C. Briand, Y. Labiche, and A. S. Namin.
Using mutation analysis for assessing and comparing testing
coverage criteria. IEEE Transactions on Software Engineering, 32(8):608–624, 2006.
[3] S. Antoy and D. Hamlet. Automatically checking an implementation against its formal specification. IEEE Transactions on Software Engineering, 26(1):55–69, 2000.
[4] C. Boyapati, S. Khurshid, and D. Marinov. Korat: Automated testing based on Java predicates. In ISSTA ’02: Proceedings of the 2002 ACM SIGSOFT International Sympo-
70
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
sium on Software Testing and Analysis, pages 123–133, New
York, NY, USA, 2002. ACM Press.
K. Bruce, L. Cardelli, G. Castagna, G. T. Leavens, and
B. Pierce. On binary methods. Theory and Practice of Object Systems, 1(3):221–242, 1995.
K. B. Bruce, A. Schuett, R. van Gent, and A. Fiech. Polytoil: A type-safe polymorphic object-oriented language.
ACM Transactions on Programming Languages and Systems, 25(2):225–290, 2003.
Y. Cheon. Automated random testing to detect specificationcode inconsistencies. Technical Report 07-07, The University of Texas at El Paso, February 2007.
Y. Cheon and G. T. Leavens. A simple and practical
approach to unit testing: The JML and JUnit way. In
ECOOP ’02: Proceedings of the 16th European Conference
on Object-Oriented Programming, pages 231–255, London,
UK, 2002. Springer-Verlag.
Y. Cheon and A. Perumandla. Specifying and checking
method call sequences of Java programs. Software Quality
Journal, 15(1):7–25, 2007.
M. d’Amorim, C. Pacheco, T. Xie, D. Marinov, and M. D.
Ernst. An empirical comparison of automated generation
and classification techniques for object-oriented unit testing. In ASE ’06: Proceedings of the 21st IEEE International
Conference on Automated Software Engineering (ASE’06),
pages 59–68, Washington, DC, USA, 2006. IEEE Computer
Society.
R.-K. Doong and P. G. Frankl. The ASTOOT approach
to testing object-oriented programs. ACM Transactions on
Software Engineering Methodology, 3(2):101–130, 1994.
S. H. Edwards. Black-box testing using flowgraphs: An
experimental assessment of effectiveness and automation
potential. Software Testing, Verification and Reliability,
10(4):249–262, 2000.
M.-C. Gaudel. Testing can be formal, too. In TAPSOFT
’95: Proceedings of the 6th International Joint Conference
CAAP/FASE on Theory and Practice of Software Development, pages 82–96, London, UK, 1995. Springer-Verlag.
D. E. Harms and B. W. Weide. Copying and swapping:
Influences on the design of reusable software components.
IEEE Transactions on Software Engineering, 17(5):424–
435, May 1991.
K. J. Hayhurst, D. S. Veerhusen, J. J. Chilenski, and L. K.
Rierson. A practical tutorial on modified condition/decision
coverage. Technical report, NASA, 2001.
G. T. Leavens, A. L. Baker, and C. Ruby. JML: A notation for detailed design. In H. Kilov, B. Rumpe, and
I. Simmonds, editors, Behavioral Specifications of Businesses and Systems, pages 175–188. Kluwer Academic Publishers, 1999.
A. Leitner, I. Ciupa, B. Meyer, and M. Howard. Reconciling
manual and automated testing: The autotest experience. In
HICSS ’07: Proceedings of the 40th Annual Hawaii International Conference on System Sciences, page 261a, Washington, DC, USA, 2007. IEEE Computer Society.
D. Marinov and S. Khurshid. TestEra: A novel framework
for automated testing of Java programs. In ASE ’01: Proceedings of the 16th IEEE International Conference on Automated Software Engineering, page 22, Washington, DC,
USA, 2001. IEEE Computer Society.
[19] B. Meyer. Applying ’design by contract’. Computer,
25(10):40–51, 1992.
[20] A. J. Offutt, A. Lee, G. Rothermel, R. H. Untch, and C. Zapf.
An experimental determination of sufficient mutant operators. ACM Transactions on Software Engineering Methodology, 5(2):99–118, 1996.
[21] M. Sitaraman and B. Weide. Component-based software
using RESOLVE. SIGSOFT Software Engineering Notes,
19(4):21–22, 1994.
[22] K. Sullivan, J. Yang, D. Coppit, S. Khurshid, and D. Jackson. Software assurance by bounded exhaustive testing. In
ISSTA ’04: Proceedings of the 2004 ACM SIGSOFT international symposium on Software testing and analysis, pages
133–142, New York, NY, USA, 2004. ACM.
[23] B. W. Weide, W. F. Ogden, and M. Sitaraman. Recasting
algorithms to encourage reuse. IEEE Software, 11(5):80–
88, 1994.
[24] B. W. Weide, S. M. Pike, and R. Hinze. Why swapping?
In Proceedings of the RESOLVE Workshop 2002. Virginia
Tech, 2002.
[25] T. Xie, D. Notkin, and D. Marinov. Rostra: a framework
for detecting redundant object-oriented unit tests. Automated Software Engineering, 2004. Proceedings. 19th International Conference on, pages 196–205, 2004.
[26] H. Zhu, P. A. V. Hall, and J. H. R. May. Software unit
test coverage and adequacy. ACM Computing Surveys,
29(4):366–427, 1997.
71