The second column of Figure 10 lists the generators used
to test each refactoring. By tailoring inputs, we were able
to reuse generators for several refactorings. For example,
DualClassFieldReference was useful for both field- and class-
targeted refactorings and found bugs in both.
The third column lists the total number of programs gen-
erated. The number is very sensitive to the way in which
the tester initializes the generator. For example, a fully-
exhaustive DualClassFieldReferenceGenerator used for En-
capsulateField produces 14,850 programs. In contrast, a
version limited to producing inheritance relationships for
PushDownField yields just 4,635.
Execution time, shown in the fourth column, is calculated
as the time needed to generate and run all tests for a par-
ticular refactoring. Performing the refactoring takes up the
vast majority of the execution time. It takes Du alClass-
FieldReference just 13 seconds to generate 14,850 programs
when refactoring execution is omitted. This fact is further
corroborated by the close correspondence between compi-
lable inputs—column 5—and execution time. Our suite
tests compilable inputs at a consistent rate of about 100
per minute.
We ran our tests on a dual-processor 1.8 Ghz Dell D820
laptop with 1 gigabyte of RAM. It took just under seven
hours to run our entire suite in Eclipse. Netbeans took about
four times longer. While this time is not conducive to on-
the-fly development testing, but it is certainly feasible for
an overnight or weekend build process.
MethodReference and MethodParamReference are par-
ticularly interesting for several reasons. First, they cre-
ate their method and p arameter references by reusing the
ParenthesizingExpression, NestedExpression, and
ExpressionInStatement generators described in Section 4.2.1.
This demonstrates how one can adapt existing generators for
different tests. Second, it took one of us about two workdays
to write both generators as well as infrastructure needed to
run the four refactorings that they test. Together, th ey gen-
erate many more tests (20,760 of which 19,852 compile) than
even the most talented tester could produce by hand in the
same amount of t ime. Finally, by filtering in the manner
described in Section 4.2.2, these generators produced only
compilable inputs for three refactorings. In th e other cases
we relied on the compiler so that we did not accidentally
filter valid tests.
6.3 Oracle Evaluation
Columns six through 11 of Figure 10 show the oracles
exercized by each refactoring. The DoesNotCompile oracle
yielded the most bugs in Eclipse while the Differential oracle
found the most in NetBeans. The Custom-Structural oracle
found one bug for Encap su lateField. We have omitted a col-
umn for the DoesNotCrash oracle since neither refactoring
engine crashed.
In all cases, the WarningStatus oracle found only expected
precondition failures, not bugs. Note, for example, that ev-
ery compilable input generated by MethodParamReference
for the RemoveParameter refactoring failed the WarningSta-
tus oracle, just as one would expect. Had any passed, it
would have indicated a bug since one cannot remove a pa-
rameter if it is referenced in the method body. This demon-
strates the usefulness of generating inputs that are k nown
to fail a particular oracle.
6.4 Bugs Found
The final two columns of Figure 10 show the previously
unreported, unique bugs that ASTGen found: 9 in Eclipse
and 10 in Netbeans. Since our approach is exhaustive, the
oracles report many failures for each unique bug. This means,
for example, that we only reported one unique bug for the 40
variations caught by the Differential oracle for RenameField.
A summary of each bug and links to the Eclipse and Net-
beans bug reports can be found at [1].
7. RELATED WORK
There is a large body of work in the area of test-input
generation. The most related to ours are grammar-based
and bounded-ex haustive testing approaches.
Grammar-based testing [14,15,19,23,25] requires the user
to describe test inputs with a grammar, and the tools then
generate a set of strings that belong to the grammar. In
1972, Purdom [23] pioneered the algortihms for selecting
a minimal set strings that achieve certain coverage criteria
for grammar, e.g., strings that cover all terminals, all n on-
terminals, or all productions. More recently, Maurer [19],
Sirer and Bershad [25], and Malloy and Power [15] devel-
oped tools for grammar-based generation that were used to
find faults in several applications. We can view grammar-
based app roaches effectively as using first-order functional
programs to specify the generation. The tools interpret
these programs to generate random strings that belong to
the input grammar. Other work by Claessen and Hughes [4]
describes a random, but not grammar-based testing frame-
work for Haskell Programs.
In constrast, the approach of Lammel and Schulte [14]
and our approach systematically generate input data. Even
more imp ortantly, our approach allows the developers to use
the full expressive power of familiar programming language
such as Java to write imperative generators that produce
test inputs. With our approach, the programmers can freely
compose more advanced generators by reusing more basic
generators. Achieving such reusability with grammars is
fairly hard; for example, it is unclear how one could combine
in a grammar-based approach the first two generators from
Section 2 to obtain the third.
Bounded-exhaustive testing [3, 12, 13, 17, 18, 26] is an ap-
proach for testing the programs exhaustively on all inputs
within the given bound. We have previously developed two
approaches, TestEra [18] and Korat [3], that can be in princi-
ple used for bounded-exhaustive generation of complex test
inputs such as Java programs. These two approaches are
declarative: they require the user to specify the constraints
that describe what the test inputs lo ok like (as well as the
bound on the size of test inputs), and the tools then au-
tomatically search the (bounded) input space to generate
all inputs that satisfy the constraints. TestEra requires the
user to specify the constraints in a declarative language,
while Korat requires the users to specify the constraints in
an imperative language, but in both previous approaches,
the user just specifies the constraints. In constrast, the ap-
proach presented in this paper is imperative: the program-
mer specifies how the test generation should proceed. The
imperative approach makes the generation faster since no
search is necessary. Also, the imperative approach gives the
programmer more control over the generation, for exam-
ple over the order of generation. Finally, the two previous