block (including method, constructor and initializer). In addition, a
virtual root node is included in each model, corresponding to the
system-version as a whole.
Each extracted entity is described in terms of its type, name,
visibility (public, protected, private, or default if declared without
any access modifier) and non-access modifiers (static, final, syn-
chronized, volatile, transient, native, abstract, strictfp), location in
the source file, whether it belongs in the system source code or in a
library, and the version of the system in which it belongs. The
name of array types is in the form of “basetypequalified-
name.dimension”. The name of packages, named classes, inter-
faces and fields is their declared identifier. The name of methods
and constructors is in the form of “identifier(paramtype_list)”.
JDEvAn’s fact extractor assigns names to anonymous classes and
initializers as follows: for anonymous classes, “new super-
type_identifier”; for class initializers,
“{class_identifier.$initializer_number}”; for field initializers,
“{field_identifier=…}”. Finally, a fully-qualified prefix is added in
front of the names of library entities.
The edge set, E
D
, contains tuples of the form (relation, v
1
, v
2
),
where v
1
and v
2
are nodes and relation is a UML dependency be-
tween them. The supported dependency types are containment,
declaration, inheritance, interface implementation, field read/write
and method call (including this, super, and constructor call), class
creation, field data type, method return type, parameter type, and
exceptions declared and thrown by a method. The virtual root node
(corresponding to a system-version) is assumed to contain all the
package nodes declared in that system-version.
JDEvAn implements a fact extractor based on the Eclipse Java
DOM/AST model [28]. In particular, it implements an abstract
syntax tree (AST) visitor to visit the appropriate source code enti-
ties and relations as the Java program is being parsed. The ex-
tracted ground model facts are stored in a PostgreSQL relational
database (entity and relation table). Based on these ground facts, a
variety of derived facts are inferred, such as top-level type or
member type, methods declared in a class or an interface, inherit-
able and inherited fields/methods, and so on. In most cases, derived
facts are defined as database views; however, to improve
JDEvAn’s performance, some frequently used derived relations,
such as method implementations and overrides, class/interface
usage, etc., are computed and stored into the database tables at the
end of fact-extraction process.
To address the fact that PostgreSQL lacks recursive computa-
tion capabilities, which is essential to computing the transitive
closure of various relations among entities, Simon’s transitive clo-
sure algorithm [20] has been implemented as a database server-side
extension to compute, at the end of the fact-extraction process, the
transitive closure of the containment and inheritance hierarchy,
field read/write, method call, and class/interface usage relations,
which are populated in the corresponding transclosure table.
Furthermore, the number of times that a field is read/written, a
method is called, a class is created, and a class/interface is used is
recorded. These numbers are subsequently used to compute struc-
tural similarity (see algorithms 2 and 3).
4 Design Comparison with UMLDiff
Given two versions, A and B, of a software system and their corre-
sponding models - instances of the meta-model discussed in sec-
tion 3 - UMLDiff recovers the design-level structural changes that
occurred as the system evolved from A to B.
UMLDiff assumes a “principled” usage of the versioning sys-
tem: if a substantial number or a complex set of changes are made
to a particular version before it is stored back in the version-
management system, the accuracy of UMLDiff will most likely
suffer (see section 6 for a detailed discussion on this subject).
UMLDiff is a domain-specific structural-differencing algorithm,
aware of the UML semantics. As per the adopted meta-model, the
software system is modeled as a directed graph. Table 1 lists the
relations and the types of entities that they relate, which induce a
containment-spanning tree on the directed graph.
Table 1. The children of design entities
Entity type Type of children of entity
Virtual root Packages, array types (assumed to be contained)
Package Top-level classes and interfaces it contains
Class Fields, methods, constructors, initializers, inner
classes and interfaces it declares
Interface Fields, methods, inner classes and interfaces it
declares
Field Initializer it contains
Block Local classes, anonymous classes it contains
Furthermore, based on UML semantics, the containment and
declaration relationship defines a logical partial order over entity
types: virtual root > package > (class, interface) > (field, block).
Conceptually, UMLDiff traverses the containment-spanning trees
of two compared models, moving from one logical level to the next
in both trees at the same time. It starts at the virtual root (system-
version) level, progressing down to packages, classes and inter-
faces, and finally, fields and blocks. At each logical level, it identi-
fies corresponding entities of the same type as representing a single
conceptual entity in two versions of the system. UMLDiff recog-
nizes that an entity e
1
in version A and an entity e
2
in version B are
the “same”, i.e., they correspond to the same conceptual entity,
when (a) they have the same or similar name (name-similarity
heuristic), or (b) they have similar relations to other entities, al-
ready established to be the “same” (structure-similarity heuristic).
Name similarity is a “safe” indicator that e
1
and e
2
are the same
entity: it is indeed a rare phenomenon that an entity is removed and
a new entity with the same name but different behaviour is added
to the system. UMLDiff recognizes same-name entities first and
uses them as “landmarks” to subsequently recognize renamed and
moved entities. When an entity is renamed or moved, as is fre-
quently the case with refactorings aimed at improving the extensi-
bility and maintainability of the system, its relationships to other
entities, such as the members it contains, the fields it reads/writes,
the methods it calls or is called by, etc., tend to remain the same for
the most part. Therefore, by comparing the relationships of two
same-type entities renamings or moves can be inferred: if they
share “enough” relationships to known-to-be-same entities they are
the “same”, even though their names (renamed) and/or their parent
entities in the containment-spanning tree (moved) are different.
Whenever two entities are identified as renamings or moves, this
knowledge is added to the current landmarks’ set and is used later
on to further match as yet-unmatched entities. This process contin-
ues until it reaches the logical-leaf level of the spanning trees and
all possible corresponding pairs of entities have been identified.
4.1 Similarity metrics
Let us now discuss in detail the two heuristics (name-similarity and
structure-similarity) for recognizing the conceptually same entities
in the two compared system versions. These two heuristics, based
on the semantics of the object-oriented design domain, enable
UMLDiff to recognize that two entities are the “same” even after
they have been renamed and/or moved. The key to determining
56