Multi-Criteria Code Refactoring Using Search-Based Software Engineering
Multi-Criteria Code Refactoring Using Search-Based Software Engineering
One of the most widely used techniques to improve the quality of existing software systems is refactoring—
the process of improving the design of existing code by changing its internal structure without altering its
external behavior. While it is important to suggest refactorings that improve the quality and structure of the
system, many other criteria are also important to consider, such as reducing the number of code changes,
preserving the semantics of the software design and not only its behavior, and maintaining consistency
with the previously applied refactorings. In this article, we propose a multi-objective search-based approach
for automating the recommendation of refactorings. The process aims at finding the optimal sequence of 23
refactorings that (i) improves the quality by minimizing the number of design defects, (ii) minimizes code
changes required to fix those defects, (iii) preserves design semantics, and (iv) maximizes the consistency
with the previously code changes. We evaluated the efficiency of our approach using a benchmark of six open-
source systems, 11 different types of refactorings (move method, move field, pull up method, pull up field, push
down method, push down field, inline class, move class, extract class, extract method, and extract interface)
and six commonly occurring design defect types (blob, spaghetti code, functional decomposition, data class,
shotgun surgery, and feature envy) through an empirical study conducted with experts. In addition, we
performed an industrial validation of our technique, with 10 software engineers, on a large project provided
by our industrial partner. We found that the proposed refactorings succeed in preserving the design coherence
of the code, with an acceptable level of code change score while reusing knowledge from recorded refactorings
applied in the past to similar contexts.
CCS Concepts: r Software and its engineering → Search-based software engineering
Additional Key Words and Phrases: Search-based software engineering, refactoring, software maintenance,
multi-objective optimization, software evolution
ACM Reference Format:
Ali Ouni, Marouane Kessentini, Houari Sahraoui, Katsuro Inoue, and Kalynmoy Deb. 2016. Multi-criteria
code refactoring using search-based software engineering: An industrial case study. ACM Trans. Softw. Eng.
Methodol. 25, 3, Article 23 (May 2016), 53 pages.
DOI: http://dx.doi.org/10.1145/2932631
This work was supported by the Ford-University of Michigan alliance Program, Japan Society for the Pro-
motion of Science, Grant-in-Aid for Scientific Research (S) (Grant No. 25220003), and by Osaka University
Program for Promoting International Joint Research.
Authors’ addresses: A. Ouni, Osaka University; email: ali@ist.osaka-u.ac.jp; M. Kessentini, University
of Michigan-Dearborn; email: marouane@umich.edu; H. Sahraoui, University of Montreal; email:
sahraouh@iro.umontreal.ca; K. Inoue, Osaka University; email: inoue@ist.osaka-u.ac.jp; K. Deb, Michigan
State University; email: kdeb@egr.msu.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c 2016 ACM 1049-331X/2016/05-ART23 $15.00
DOI: http://dx.doi.org/10.1145/2932631
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:2 A. Ouni et al.
1. INTRODUCTION
Large-scale software systems exhibit high complexity and become difficult to maintain.
In fact, it has been reported that the software cost attributable to maintenance and
evolution activities is more than 80% of total software costs [Erlikh 2000]. To facilitate
maintenance tasks, one of the most widely used techniques is refactoring, which im-
proves design structure while preserving external behavior [Mens and Tourwé 2004;
Opdyke 1992].
Even though most of the existing refactoring recommendation approaches are pow-
erful enough to suggest refactoring solutions to be applied, several issues still need to
be addressed. One of the most important issues is the semantic coherence of the refac-
tored program, which is not considered by most of the existing approaches [Ouni et al.
2012a; Du Bois et al. 2004; Moha et al. 2008; Mens and Tourwé 2004]. Consequently, the
refactored program could be syntactically correct, implement the correct behavior, but
be semantically incoherent. For example, a refactoring solution might move a method
calculateSalary() from class Employee to class Car. This refactoring could improve
the program structure by reducing the complexity and coupling of class Employee and
satisfy the pre- and post-conditions to preserve program behavior. However, having
a method calculateSalary() in class Car does not make any sense from the domain
semantics standpoint and is likely to lead to comprehension problems in the future.
Another issue is related to the number of code changes required to apply refactorings,
something that is not considered in existing refactoring approaches whose only aim
is to improve code quality independently of the cost of code changes. Consequently,
applying a particular refactoring may require a radical change in the system or even
its re-implementation from scratch. Thus, it is important to minimize code changes to
help developers in understanding the design after applying the proposed refactorings.
In addition, the use of development history can be an efficient aid when proposing refac-
torings. Code fragments that have previously been modified in the same time period
are likely to be semantically related (e.g., refer to the same feature). Furthermore, code
fragments that have been extensively refactored in the past have a high probability
of being refactored again in the future. Moreover, the code to refactor can be similar
to some refactoring patterns that are to be found in the development history; thus,
developers can easily adapt and reuse them.
One of the limitations of the existing works in software refactoring [Du Bois et al.
2004; Qayum and Heckel 2009; Fokaefs et al. 2011; Harman and Tratt 2007; Moha
et al. 2008; Seng et al. 2006] is that the definition of semantic coherence is closely
related to behavior preservation. Preserving the behavior does not mean that the
design semantics of the refactored program is also preserved. Another issue is that
the existing techniques are limited to a small number of refactorings, and thus it
could not be generalized and adapted for an exhaustive list of refactorings. Indeed,
semantic coherence is still hard to ensure since existing approaches do not provide a
pragmatic technique or an empirical study to prove whether the semantic coherence of
the refactored program is preserved.
In this article, we propose a multi-objective search-based approach to address the
above-mentioned limitations. The process aims at finding the sequence of refactorings
that (1) improves design quality, (2) preserves the design coherence and consistency
of the refactored program, (3) minimizes code changes, and (4) maximizes the consis-
tency with development change history. We evaluated our approach on six open-source
systems using an existing benchmark [Ouni et al. 2012a; Moha et al. 2010, 2008].
We report the results of the efficiency and effectiveness of our approach compared
to existing approaches [Harman and Tratt 2007; Kessentini et al. 2011]. In addition,
we provide an industrial validation of our approach on a large-scale project in which
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:3
the results were manually evaluated by 10 active software engineers. The study also
evaluated the relevance and usefulness of our refactoring technique in an industrial
setting.
The remainder of this article is structured as follows. Section 2 provides the necessary
background and challenges related to refactoring and code smells. Section 3 defines
refactoring recommendation as a multi-objective optimization problem, while Section 4
introduces our search-based approach to this problem using the non-dominated sorting
genetic algorithm (NSGA-II) [Deb et al. 2002]. Section 5 describes the method used
in our empirical studies and presents the obtained results, while Section 6 provides
further discussions. Section 7 presents an industrial case study along with a discussion
of the obtained results. Section 8 discusses the threats to validity and the limitations of
the proposed approach, while Section 9 describes the related work. Finally, Section 10
concludes and presents directions for future work.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:4 A. Ouni et al.
—Blob: It is found in designs where much of the functionality of a system (or part of
it) is centralized in one large class, while the other related classes primarily expose
data and provide little functionality.
—Spaghetti Code: This involves a code fragment with a complex and tangled control
structure. This code smell is characteristic of procedural thinking in object-oriented
programming. Spaghetti code is revealed by classes declaring long methods with no
parameters and utilising global variables. Names of classes and methods may suggest
procedural programming. Spaghetti code does not exploit, and indeed prevents the
use of, object-oriented mechanisms such as inheritance and polymorphism.
—Functional Decomposition: This design defect consists of a main class in which inheri-
tance and polymorphism are hardly used, that is associated with small classes, which
declare many private fields, and implement only a few methods. This is frequently
found in code produced by inexperienced object-oriented developers.
—Data Class: It is a class that contains only data and performs no processing on
these data. It is typically composed of highly cohesive fields and accessors. However,
depending on the programming context, some Data Classes might suit perfectly and,
therefore, not design defects.
—Shotgun Surgery: This occurs when a method has a large number of external methods
calling it, and these methods are spread over a significant number of classes. As a
result, the impact of a change in this method will be large and widespread.
—Feature Envy: It is found when a method heavily uses attributes and data from one or
more external classes, directly or via accessor operations. Furthermore, in accessing
external data, the method uses data intensively from at least one external source.
We choose these design defect types in our experiments because they are the most
important and common ones in object-oriented industrial projects based on recent
empirical studies [Ouni et al. 2012a; Moha et al. 2008; Ouni et al. 2013]. Moreover,
it is widely believed that design defects have a negative impact on software quality
that often leads to bugs and failures [Li and Shatnawi 2007; D’Ambros et al. 2010;
Deligiannis et al. 2003; Mäntylä et al. 2003]. Consequently, design defects should be
identified and corrected by the development team as early as possible for maintain-
ability and evolution considerations. For example, after detecting a blob defect, many
refactoring operations can be used to reduce the number of functionalities in a specific
class, such as move method and extract class.
In the next subsection, we discuss the different challenges related to fixing design
defects using refactoring.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:5
fields of classes characterize the structure and behavior of the implemented domain
elements. Consequently, a program could be syntactically correct, implement the appro-
priate behavior, but violate the domain semantics if the reification of domain elements
is incorrect. During the initial design/implementation, programs usually capture well
the domain semantics when object-oriented principles are applied. However, when
these programs are (semi-)automatically modified/refactored during maintenance, the
adequacy with regards to domain semantics could be compromised. Indeed, semantic
coherence is an important issue to consider when applying refactorings.
Most of the existing approaches suggest refactorings mainly with the perspective of
only improving some design/quality metrics. As explained, this may not be sufficient.
We need to preserve the rationale behind why and how code elements are grouped and
connected when applying refactoring operations to improve code quality.
Code changes: When applying refactorings, various code changes are performed.
The amount of code changes corresponds to the number of code elements (e.g., classes,
methods, fields, relationships, and field references) modified through adding, deleting,
or moving operations. Minimizing code changes when suggesting refactorings is im-
portant to reduce the effort and help developers understand the modified/improved
design. In fact, most developers want to keep as much as possible with the original
design structure when fixing design defects [Fowler 1999]. Hence, improving software
quality and reducing code changes are conflicting. In some cases, correcting some design
defects corresponds to changing radically a large portion of the system or is sometimes
equivalent to re-implementing a large part of the system. Indeed, a refactoring so-
lution that fixes all defects is not necessarily the optimal one due to the high code
adaptation/modification that may be required.
Consistency with development/maintenance history: The majority of the ex-
isting work does not consider the history of changes applied in the past when proposing
new refactoring solutions. However, the history of code changes can be helpful in in-
creasing the confidence of new refactoring recommendations. To better guide the search
process, recorded code changes applied in the past can be considered when proposing
new refactorings in similar contexts. This knowledge can be combined with structural
and textual information to improve the automation of refactoring suggestions.
2.3. Motivating Example
To illustrate some of these issues, Figure 1 shows a concrete example extracted from
JFreeChart1 v1.0.9, a well-known Java open-source charting library. We consider a
design fragment containing four classes, XYLineAndShapeRenderer, XYDotRenderer,
SegmentedTimeline, and XYSplineRenderer. Using design defect detection rules pro-
posed in our previous work [Kessentini et al. 2011], the class XYLineAndShapeRenderer
is detected as a design defect: blob (i.e., a large class that monopolizes the behavior of
a large part of the system).
We consider the scenario of a refactoring solution that consists of moving the
method drawItem() from class XYLineAndShapeRenderer to class SegmentedTimeline.
This refactoring can improve the design quality by reducing the number of functionali-
ties in this blob class. However, from the design semantics standpoint, this refactoring
is incoherent since SegmentedTimeline functionalities are related to presenting a se-
ries of values to be used for a curve axis (mainly for the Date-related axis) and not
for the task of drawing objects/items. Based on textual and structural information,
using respectively a semantic lexicon [Amaro et al. 2006] and cohesion/coupling [Ouni
et al. 2012b], many other target classes are possible, including XYDotRenderer and
XYSplineRenderer. These two classes have approximately the same structure that can
1 http://www.jfree.org/jfreechart/.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:6 A. Ouni et al.
be formalized using quality metrics (e.g., number of methods and number of attributes),
and their textual similarity is close to XYLineAndShapeRenderer using a vocabulary-
based measure. Thus, moving elements bamong these three classes is likely to be
semantically coherent and meaningful. On the other hand, from previous versions
of JFreeChart, we recorded that there are some methods, such as drawPrimaryLine-
AsPath(), initialise(), and equals(), that have been moved from class XYLineAnd-
ShapeRenderer to class XYSplineRenderer. As a consequence, moving methods and/or
attributes from class XYLineAndShapeRenderer to class XYSplineRenderer has higher
correctness probability than moving methods or attributes to class XYDotRenderer or
SegmentedTimeline.
Based on these observations, we believe that it is important to consider additional
objectives rather than use only structural metrics to ensure quality improvement.
However, in most of the existing work, design semantics, amount of code changes, and
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:7
development history are not considered. Improving code structure, minimizing design
incoherencies, reducing code changes, and maintaining consistency with development
change history are conflicting goals. In some cases, improving the program structure
could provide a design that does not make sense semantically or could change radically
the initial design. For this reasons, an effective refactoring strategy needs to find a
compromise between all of these objectives. These observations are the motivation for
the work described in this article.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:8 A. Ouni et al.
supports 11 refactoring operations, including move method, move field, pull up field,
pull up method, push down field, push down method, inline class, extract method,
extract class, move class, and extract interface (cf. Table II) [Fowler 1999], but not all
refactorings in the literature.2 We selected these refactorings because they are the most
frequently used refactorings and they are implemented in most modern Integrated
Development Environment (IDE) such as Eclipse and Netbeans. In the following, we
describe the formal formulation of the four objectives to optimize.
Table I shows how the code change score is calculated for each refactoring operation.
As described in the table, to estimate the number of required code changes for a high-
level refactoring, our method considers the number of low-level refactoring operations
(atomic changes) needed to actually implement such a refactoring based on the Soot
tool. For instance, to move a method m from a class c1 to a class c2 , the required number
of chance is calculated as follows: 1 add method with a weight wi = 1, 1 delete method
with a wi = 1, n redirect method call with a wi = 2, and n redirect field access with
a wi = 2 as described in Table I. Using appropriate static code analysis, Soot allows
us to easily calculate the value n by capturing the number of field references/accesses
from a method, the number of calls that should be redirected based on call graph), the
number of return types and parameters of a method, as well as the control flow graph
of a method, and so on.
3.2.3. Similarity with Recorded Code Changes. We defined the following function to calcu-
late the similarity score between a proposed refactoring operation and a recorded code
2 http://refactoring.com/catalog/.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:9
Table I. High- and Low-Level Refactoring Operations and Their Associated Change Scores
Low-level refactoring
Remove parameter
Rename method
Add parameter
Delete method
Rename class
Rename field
Create class
Add method
Delete class
Delete field
Add field
High-level refactoring
Weight wi 2 3 1 3 1 3 2 2 1 2 1 1 1
Move method 1 1 n n
Move field 1 1 n
Pull up field 1 1 n n
Pull up method 1 1 n n n
Push down field 1 1 n n
Push down method 1 1 n n n
Inline class 1 n
Extract method 1 n n n n n
Extract class 1 n n n n n n
Move class 1 1 n n n
Extract interface 1 n n n n n n n n
change:
n
Sim refactoring history(RO) = ej, (3)
j=1
where n is the number of recorded refactoring operations applied to the system in the
past and e j is a refactoring weight that reflects the similarity between the suggested
refactoring operation (RO) and the recorded refactoring operation j. The weight e j is
computed as follows: If the suggested and the recorded refactorings being compared are
identical, for example, Move Method between the same source and target classes, then
weight e j = 2. If the suggested and the recorded refactorings are similar, then e j = 1.
We consider two refactoring operations as similar if one of them is composed of the other
or if their implementations are similar, using equivalent controlling parameters, that is,
the same code fragments, as described in Table II. Some complex refactoring operations,
such as Extract Class, can be composed of other refactoring operations such as Move
Method, Move Field, Create New Class, and so on, the weight w j = 1. Otherwise, w j = 0.
More details about the similarity scores between refactoring operations can be found
in Ouni et al. [2013].
3.2.4. Semantics. To the best of our knowledge, there is no consensual way to investi-
gate whether refactoring can preserve the design semantics of the original program. We
formulate semantic coherence using a meta-model in which we describe the concepts
from a perspective that helps in automating the refactoring recommendation task. The
aim is to provide a terminology that will be used throughout this article. Figure 3 shows
the semantic-based refactoring meta-model. The class Refactoring represents the main
entity in the meta-model. As mentioned earlier, we classify refactoring operations into
two types: low-level ROs (LLR) and high-level ROs (HLR). A LLR is an elementary/
basic program transformation for adding, removing, and renaming program elements
(e.g., Add Method, Remove Field, and Add Relationship). LLRs can be combined to
perform more complex refactoring operations (HLRs) (e.g., Move Method and Extract
Class). An HLR consists of a sequence of two or more LLRs or HLRs; for example, to
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:10 A. Ouni et al.
Table II. Refactoring Operations and Their Involved Actors and Roles
Refactoring operation Actors Roles
class source class, target class
Move method
method moved method
class source class, target class
Move field
field moved field
class source class, target class
Pull up field
field moved field
class source class, target class
Pull up method
method moved method
class source class, target class
Push down field
field moved field
class source class, target class
Push down method
method moved method
Inline class class source class, target class
class source class, target class
Extract method method source method, new method
statement moved statements
class source class, new class
Extract class field moved fields
method moved methods
package source package, target package
Move class
class moved class
class source classes, new interface
Extract interface field moved fields
method moved methods
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:11
perform Extract Class we need to Create New Empty Class and apply a set of Move
Method and Move Field operations.
To apply a refactoring operation we need to specify which actors, that is, code frag-
ments, are involved in this refactoring and which roles they play when performing the
refactoring operation. As illustrated in Figure 3, an actor can be a package, class, field,
method, parameter, statement, or variable. In Table II, we specify for each refactoring
operation the involved actors and their roles.
3.3. Design Coherence Measures
3.3.1. Vocabulary-based Similarity (VS). This kind of similarity is interesting to consider
when moving methods, fields, or classes. For example, when a method has to be moved
from one class to another, the refactoring would make sense if both actors (source class
and target class) use similar vocabularies [Ouni et al. 2012b]. The vocabulary could be
used as an indicator of the semantic/textual similarity between different actors that
are involved when performing a refactoring operation. We start from the assumption
that the vocabulary of an actor is borrowed from the domain terminology and therefore
can be used to determine which part of the domain semantics an actor encodes. Thus,
two actors are likely to be semantically similar if they use similar vocabularies.
The vocabulary can be extracted from the names of methods, fields, variables, pa-
rameters, types, and so on. Tokenisation is performed using the Camel Case Splitter
[Corazza et al. 2012], which is one of the most used techniques in Software Mainte-
nance tools for the preprocessing of identifiers. A more pertinent vocabulary can also be
extracted from comments, commit information, and documentation. We calculate the
semantic similarity between actors using an information retrieval-based technique,
namely cosine similarity, as shown in Equation (4). Each actor is represented as an
n-dimensional vector, where each dimension corresponds to a vocabulary term. The
cosine of the angle between two vectors is considered an indicator of similarity. Using
cosine similarity, the conceptual similarity between two actors c1 and c2 is determined
as follows:
n
c1 · c2 wi,1 × wi,2
Sim(c1 , c2 ) = Cos(c1 , c2 ) = = i=1 , (4)
c1 × c2 n
w 2
× n
w 2
i=1 i,1 i=1 i,2
where c1 = (w1,1 , . . . , wn,1 ) is the term vector corresponding to actor c1 and c2 =
(w1,2 , . . . , wn,2 ) is the term vector corresponding to c2 . The weights wi, j can be computed
using information retrieval-based techniques such as the Term Frequency–Inverse
Term Frequency (TF-IDF) method. We used a method similar to that described in
Hamdi [2011] to determine the vocabulary and represent the actors as term vectors.
3.3.2. Dependency-Based Similarity (DS). We approximate domain semantics closeness
between actors starting from their mutual dependencies. The intuition is that actors
that are strongly connected (i.e., having dependency links) are semantically related. As
a consequence, refactoring operations requiring semantic closeness between involved
actors are likely to be successful when these actors are strongly connected. We consider
two types of dependency links based on use of the Jaccard similarity coefficient as the
way to compute the similarity [Jaccard 1901]:
—Shared Field Access (SFA) that can be calculated by capturing all field references
that occur using static analysis to identify dependencies based on field accesses
(read or modify). We assume that two software elements are semantically related
if they read or modify the same fields. The rate of shared fields (read or modified)
between two actors c1 and c2 is calculated according to Equation (5). In this equation,
f ieldRW(ci ) computes the number of fields that may be read or modified by each
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:12 A. Ouni et al.
method of the actor ci . Note that only direct field access is considered (indirect field
accesses through other methods are not taken into account). By applying a suitable
static program analysis to the whole method body, all field references that occur can
be easily computed,
| f ieldRW(c1 ) ∩ f ieldRW(c2 ) |
sharedFieldsRW(c1 , c2 ) = . (5)
| f ieldRW(c1 ) ∪ f ieldRW(c2 ) |
—Shared Method Calls (SMC) that can be captured from call graphs derived from
the whole program using CHA (Class Hierarchy Analysis) [Vallée-Rai et al. 2000].
A call graph is a directed graph that represents the different calls (call in and call
out) among all methods of the entire program. Nodes represent methods, and edges
represent calls between these methods. CHA is a basic call graph that considers
class hierarchy information, for example, for a call c.m(...) assume that any m(...) is
reachable that is declared in a subtype or sometimes supertype of the declared type
of c. For a pair of actors, shared calls are captured through this graph by identifying
shared neighbours of nodes related to each actor. We consider both, shared call-out
and shared call-in. Equations (6) and (7) are used to measure, respectively, the shared
call-out and the shared call-in between two actors c1 and c2 (two classes, for example).
| callOut(c1 ) ∩ callOut(c2 ) |
sharedCallOut(c1 , c2 ) = , (6)
| callOut(c1 ) ∪ callOut(c2 ) |
| callIn(c1 ) ∩ callIn(c2 ) |
sharedCallIn(c1 , c2 ) = . (7)
| callIn(c1 ) ∪ callIn(c2 ) |
A shared method call is defined as the average of shared call-in and call-out.
3.3.3. Implementation-Based Similarity (IS). For some refactorings like Pull Up Method,
methods having similar implementations in all subclasses of a super class should be
moved to the super class [Fowler 1999]. The implementation similarity of the methods
in the subclasses is investigated at two levels: the signature level and the body level.
To compare the signatures of methods, a semantic comparison algorithm is applied.
It considers the methods names, the parameter lists, and return types. Let Sig(mi ) be
the signature of method mi . The signature similarity for two methods m1 and m2 is
computed as follows:
| Sig(m1 ) ∩ Sig(m2 ) |
Sig sim(m1 , m2 ) = . (8)
| Sig(m1 ) ∪ Sig(m2 ) |
To compare method bodies, we use Soot [Vallée-Rai et al. 2000], a Java optimization
framework, that compares the statements in the body, the used local variables, the ex-
ceptions handled, the call-outs, and the field references. Let Body(m) (set of statements,
local variables, exceptions, call-outs, and field references) be the body of method m. The
body similarity for two methods m1 and m2 is computed as follows:
| Body(m1 ) ∩ Body(m2 ) |
Body sim(m1 , m2 ) = . (9)
| Body(m1 ) ∪ Body(m2 ) |
The implementation similarity between two methods is the average of their Sig Sim
and Body Sim values.
3.3.4. Feature Inheritance Usefulness (FIU). This factor is useful when applying the Push
Down Method and Push Down Field operations. In general, when method or field is
used by only few subclasses of a super class, it is better to move it, that is, push it down,
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:13
from the super class to the subclasses using it [Fowler 1999]. To do this for a method,
we need to assess the usefulness of the method in the subclasses in which it appears.
We use a call graph and consider polymorphic calls derived using XTA (Separate Type
Analysis) [Tip and Palsberg 2000]. XTA is more precise than CHA by giving a more
local view of what types are available. We are using Soot [Vallée-Rai et al. 2000] as a
standalone tool to implement and test all the program analysis techniques required in
our approach. The inheritance usefulness of a method is given by Equation (10):
n
call(m, i)
FIU(m, c) = 1 − i=1 , (10)
n
where n is the number of subclasses of the superclass c, m is the method to be pushed
down, and call is a function that returns 1 if m is used (called) in the subclass i, and 0
otherwise.
For the refactoring operation Push Down Field, a suitable field reference analysis is
used. The inheritance usefulness of a field is given by Equation (11):
n
use( f, ci )
FIU( f, c) = 1 − i=1 , (11)
n
where n is the number of subclasses of the superclass c, f is the field to be pushed
down, and use is a function that return 1 if f is used (read or modified) in the subclass
ci , and 0 otherwise.
3.3.5. Cohesion-Based Dependency (CD). We use a cohesion-based dependency measure
for the Extract Class refactoring operation. The cohesion metric is typically one of
the important metrics used to identify and fix design defects [Moha et al. 2010, 2008;
Marinescu 2004; Bavota et al. 2011; Tsantalis and Chatzigeorgiou 2011]. However,
the cohesion-based similarity that we propose for code refactoring, in particular when
applying extract class refactoring, is defined to find a cohesive set of methods and
attributes to be moved to the newly extracted class. A new class can be extracted from
a source class by moving a set of strongly related (cohesive) fields and methods from
the original class to the new class. Extracting this set will improve the cohesion of
the original class and minimize the coupling with the new class. Applying the Extract
Class refactoring operation on a specific class will result in this class being split into
two classes. We need to calculate the semantic similarity between the elements in the
original class to decide how to split the original class into two classes.
We use vocabulary-based similarity and dependency-based similarity to find the
cohesive set of actors (methods and fields). Consider a source class that contains n
methods {m1 , . . . mn} and m fields { f1 , . . . fm}. We calculate the similarity between each
pair of elements (method-field and method-method) in a cohesion matrix as shown in
Table III.
The cohesion matrix is obtained as follows: For the method-method similarity, we
consider both vocabulary- and dependency-based similarity. For the method-field simi-
larity, if the method mi may access (read or write) the field f j , then the similarity value
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:14 A. Ouni et al.
is 1. Otherwise, the similarity value is 0. The column “Average” contains the average
of similarity values for each line. The suitable set of methods and fields to be moved to
a new class is obtained as follows: We consider the line with the highest average value
and construct a set that consists of the elements in this line that have a similarity
value that is higher than a threshold equal to 0.5. We used a trial-and-error strategy
to find this suitable threshold value after executing our similarity measure more than
30 times.
Our decision to use such a technique is driven by the computation complexity since
heavy and complex techniques might affect the whole search process. While cohesion
is one of the strongest metrics that is already used in related work [Fokaefs et al. 2011;
Bavota et al. 2014a, 2011, 2014b; Fokaefs et al. 2012; Bavota et al. 2010] for identifying
extract class refactoring opportunities, we are planning to combine it with coupling
metric in order to reduce coupling between the extracted class and the original one.
4. NSGA-II FOR SOFTWARE REFACTORING
This section is dedicated to describing how we encoded the problem of finding a good
refactoring sequence as an optimization problem using the non-dominated sorting
genetic algorithm NSGA-II [Deb et al. 2002].
4.1. NSGA-II Overview
One of the most powerful multi-objective search techniques is NSGA-II [Deb et al.
2002], which has shown good performance in solving several software engineering
problems [Harman et al. 2012].
A high-level view of NSGA-II is depicted in Algorithm 1. NSGA-II starts by randomly
creating an initial population P0 of individuals encoded using a specific representation
(line 1). Then, a child population Q0 is generated from the population of parents P0
(line 2) using genetic operators (crossover and mutation). Both populations are merged
into an initial population R0 of size N (line 5). Fast-non-dominated-sort [Deb et al.
2002] is the technique used by NSGA-II to classify individual solutions into different
dominance levels (line 6). Indeed, the concept of non-dominance consists of comparing
each solution x with every other solution in the population until it is dominated (or
not) by one of them. According to Pareto optimality: “A solution x1 is said to dominate
another solution x2 , if x1 is no worse than x2 in all objectives and x1 is strictly better
than x2 in at least one objective.” Formally, if we consider a set of objectives fi , i ∈ 1..n,
to maximize, a solution x1 dominates x2 :
iff ∀i, fi (x2 ) fi (x1 ) and ∃ j | f j (x2 ) < f j (x1 ).
The whole population that contains N individuals (solutions) is sorted using the
dominance principle into several fronts (line 6). Solutions on the first Pareto-front F0
get assigned dominance level of 0 Then, after taking these solutions out, fast-non-
dominated-sort calculates the Pareto-front F1 of the remaining population; solutions
on this second front get assigned dominance level of 1, and so on. The dominance level
becomes the basis of selection of individual solutions for the next generation. Fronts are
added successively until the parent population Pt+1 is filled with N solutions (line 8).
When NSGA-II has to cut off a front Fi and select a subset of individual solutions
with the same dominance level, it relies on the crowding distance [Deb et al. 2002]
to make the selection (line 9). This parameter is used to promote diversity within the
population. The crowding distance of a non-dominated solution serves for getting an
estimate of the density of solutions surrounding it in the population. It is calculated by
the size of the largest cuboid enclosing each particle without including any other point.
Hence, the crowding distance mechanism ensures the selection of diversified solutions
having the same dominance level. The front Fi to be split is sorted in descending
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:15
order (line 13), and the first (N- |Pt+1 |) elements of Fi are chosen (line 14). Then a
new population Qt+1 is created using selection, crossover, and mutation (line 15). This
process will be repeated until reaching the last iteration according to stop criteria
(line 4).
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:16 A. Ouni et al.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:17
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:18 A. Ouni et al.
the population using the dominance principle which classifies individual solutions into
different dominance levels. Then, to construct a new offspring population Qt+1 , NSGA-II
uses a comparison operator based on a calculation of the crowding distance [Deb et al.
2002] to select potential individuals having the same dominance level.
4.2.4. Genetic Operators. To better explore the search space, crossover and mutation
operators are defined.
For crossover, we use a single, random, cut-point crossover. It starts by selecting and
splitting at random two parent solutions. Then crossover creates two child solutions
by putting, for the first child, the first part of the first parent with the second part of
the second parent, and, for the second child, the first part of the second parent with
the second part of the first parent. This operator must ensure that the length limits
are respected by eliminating randomly some refactoring operations. As illustrated in
Figure 5, crossover splits the parent solutions in the position i = 3 within their repre-
sentative vectors in order to generate new child solutions. Each child combines some
of the refactoring operations of the first parent with some ones of the second parent.
In any given generation, each solution will be the parent in at most one crossover
operation.
The mutation operator picks randomly one or more operations from a sequence and
replaces them with other ones from the initial list of possible refactorings. An example
is shown in Figure 6 where a mutation operator is applied with two random positions
to modify two dimensions of the vector in the third and the fifth dimensions ( j = 3 and
k = 5).
After applying genetic operators (mutation and crossover), we verify the feasibility
of the generated sequence of refactoring by checking the pre- and post-conditions.
Each refactoring operation that is not feasible due to unsatisfied preconditions will
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:19
be removed from the generated refactoring sequence. The new sequence is considered
valid in our NSGA-II adaptation if the number of rejected refactorings is less than 5%
of the total sequence size. We used trial and error to find this threshold value after
several executions of our algorithm. The rejected refactorings will not be considered
anymore in the solution.
5. VALIDATION AND EXPERIMENTATION DESIGN
In order to evaluate the feasibility and the efficiency of our approach for generating
good refactoring suggestions, we conducted an experiment based on different versions of
open-source systems. We start by presenting our research questions. Then, we describe
and discuss the obtained results. All experimentation materials are available online.3
5.1. Research Questions
In our study, we assess the performance of our refactoring approach by determining
whether it can generate meaningful sequences of refactorings that fix design defects
while minimizing the number of code changes, preserving the semantics of the design,
and reusing, as much as possible, a base of recorded refactoring operations applied
in the past in similar contexts. Our study aims at addressing the research questions
outlined below.
The first four research questions evaluate the ability of our proposal to find a com-
promise between the four considered objectives that can lead to good refactoring rec-
ommendation solutions.
—RQ1.1: To what extent can the proposed approach fix different types of design
defects?
—RQ1.2: To what extent does the proposed approach preserve design semantics when
fixing defects?
—RQ1.3: To what extent can the proposed approach minimize code changes when
fixing defects?
—RQ1.4: To what extent can the use of previously applied refactorings improve the
effectiveness of the proposed refactorings?
—RQ2: How does the proposed multi-objective approach based on NSGA-II perform
compared to other existing search-based refactoring approaches and other search
algorithms?
—RQ3: How does the proposed approach perform compared to existing approaches not
based on heuristic search?
—RQ4: Is our multi-objective refactoring approach useful for software engineers in
real-world setting?
To answer RQ1.1, we validate the proposed refactoring operations to fix design
defects by calculating the defect correction ratio (DCR) on a benchmark composed
of six open-source systems. DCR is given by Equation (1), which corresponds to the
complement of the ratio of the number of design defects after refactoring (detected
using bad smells detection rules) over the total number of defects that are detected
before refactoring.
To answer RQ1.2, we use two different validation methods: manual validation and
automatic validation to evaluate the efficiency of the proposed refactorings. For the
manual validation, we asked groups of potential users of our refactoring tool to eval-
uate, manually, whether the suggested refactorings are feasible and make sense se-
mantically. We define the metric “refactoring precision” (RP), which corresponds to the
number of meaningful refactoring operations (low level and high level), in terms of
3 http://www-personal.umd.umich.edu/∼marouane/tosemref.html.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:20 A. Ouni et al.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:21
use heuristic search techniques in terms of DCR, change score, and RP. The current
version of JDeodorant [Fokaefs et al. 2012] is implemented as an Eclipse plug-in that
identifies some types of design defects using quality metrics and then proposes a list
of refactoring strategies to fix them.
To answer RQ4, we asked six software engineers (two groups of three developers
each) to refactor manually some of the design defects and then compare the results
with those proposed by our tool. We, thus, define the following precision metric:
| Rt ∩ Rm |
Precision = ∈ [0, 1], (16)
Rm
where Rt is the set of refactorings suggested by our tool and Rm is the set of refactorings
suggested manually by software engineers. We calculated an exact matching score
when comparing between the parameters (i.e., actors as described in Table II) of the
refactoring suggested by our approach and the ones identified by developers. However,
we do not consider the order of the parameters in the comparison formula.
5.2. Experimental Setting and Instrumentation
The goal of the study is to evaluate the usefulness and the effectiveness of our refactor-
ing tool in practice. We conducted an evaluation with potential users of our tool. Thus,
refactoring operations should not only remove design defects but also be meaningful
from a developer’s point of view.
5.2.1. Subjects. Our study involved a total number of 24 subjects divided into 8 groups
(3 subjects each). All the subjects are volunteers and familiar with Java development.
The experience of these subjects on Java programming ranged from 2 to 15 years.
The participants who evaluated the open source systems have a good knowledge about
these systems and they did similar experiments in the past on the same systems. We
selected also the groups based on their familiarity with the studied systems.
The first six groups are drawn from several diverse affiliations: the University of
Michigan (USA), University of Montreal (Canada), Missouri University of Science and
Technology (USA), University of Sousse (Tunisia), and a software development and
web design company. The groups include four undergraduate students, seven master
students, eight Ph.D. students, one faculty member, and four junior software develop-
ers. The three master students are working also at General Motors as senior software
engineers. Subjects were familiar with the practice of refactoring.
5.2.2. Systems Studied and Data Collection. We applied our approach to a set of six
well-known and well-commented industrial open-source Java projects: Xerces-J,4
JFreeChart,5 GanttProject,6 Apache Ant,7 JHotDraw,8 and Rhino.9 Xerces-J is a fam-
ily of software packages for parsing XML. JFreeChart is a powerful and flexible Java
library for generating charts. GanttProject is a cross-platform tool for project schedul-
ing. Apache Ant is a build tool and library specifically conceived for Java applications.
JHotDraw is a GUI framework for drawing editors. Finally, Rhino is a JavaScript inter-
preter and compiler written in Java and developed for the Mozilla/Firefox browser. We
selected these systems for our validation because they range from medium- to large-
sized open-source projects, which have been actively developed since the mid-2000s,
4 http://xerces.apache.org/xerces-j/.
5 http://www.jfree.org/jfreechart/.
6 www.ganttproject.biz.
7 http://ant.apache.org/.
8 http://www.jhotdraw.org/.
9 http://www.mozilla.org/rhino/.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:22 A. Ouni et al.
and their design has not been responsible for a slowdown of their developments. Table V
provides some descriptive statistics about these six programs.
To collect refactorings applied in previous program versions, and the expected refac-
torings applied to next version of studied systems, we use Ref-Finder [Prete et al.
2010]. Ref-Finder, implemented as an Eclipse plug-in, can identify refactoring opera-
tions applied between two releases of a software system. Table VI reports the analyzed
versions and the number of refactoring operations, identified by Ref-Finder, between
each subsequent couple of analyzed versions, after the manual validation. In our study,
we consider only refactoring types described in Table II.
5.2.3. Scenarios. We designed the study to answer our research questions. Our exper-
imental study consists of two main scenarios: (1) the first scenario is to evaluate the
quality of the suggested refactoring solutions with potential users (RQ1-3), and (2) the
second scenario is to fix manually a set of design defects and compare the manual
results with those proposed by our tool (RQ4). All the recommended refactorings are
executed using the Eclipse platform.
All the software engineers who accepted an invitation to participate in the study re-
ceived a questionnaire, a manuscript guide that helps to fill the questionnaire, and the
source code of the studied systems in order to evaluate the relevance of the suggested
refactorings to fix. The questionnaire is organized in an Excel file with hyperlinks to
visualize the source code of the affected code elements easily. The participants were
able to edit and navigate the code through Eclipse.
Scenario 1: The groups of subjects were invited to fill a questionnaire that aims to
evaluate our suggested refactorings. The questionnaires rely on a 4-point Likert scale
[Likert 1932] in which we offered a choice of pre-coded responses for every question with
no “neutral” option. Thereafter, we assigned to each group a set of refactoring solutions
suggested by our tool to evaluate manually. The participants were able to edit and
navigate the code through the Eclipse IDE. Table VII describes the set of refactoring
solutions to be evaluated for each studied system in order to answer our research
questions. We have three multi-objective algorithms to be tested for the refactoring
suggestion task: NSGA-II [Deb et al. 2002], MOGA [Fonseca et al. 1993], and Random
Search (RS) [Zitzler and Thiele 1998]. Moreover, we compared our results with a mono-
objective GA to assess the need for a multi-objective formulation. In addition, two
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:23
Table VII. Refactoring Solutions for Each Studied System Considering Each Objective: Quality (Q),
Semantic Coherence (S), Code Changes (CC), Recorded Refactorings (RR), CBO (Coupling
between Objects), and SDMPC (Standard Deviation of Methods Per Class)
Ref. Solution Algorithm/ Approach # objective Functions Objectives considered
Solution 1 NSGA-II 4 Q, S, CC, RR
Solution 2 MOGA 4 Q, S, CC, RR
Solution 3 Random Search (RS) 4 Q, S, CC, RR
Solution 4 Genetic Algorithm 1 Q + S + CC + RR
Solution 5 Kessentini et al. 1 Q
Solution 6 Harman et al. 2 CBO, SDMPC
refactoring solutions of both state-of-the art works (Kessentini et al. [2011] and Harman
and Tratt [2007]) are empirically evaluated in order to compare them to our approach
in terms of design coherence.
As shown in Table VII, for each system, 6 refactoring solutions have to be evaluated.
Due to the large number of refactoring operations to be evaluated (36 solutions in
total, each solution consists of a large set of refactoring operations), we pick at random
a sample of 10 sequential refactorings per solution to be evaluated in our study. In
Table VIII, we summarize how we divided subjects into groups in order to cover the
evaluation of all refactoring solutions. In addition, as illustrated in Table VIII, we are
using a cross-validation for the first scenario to reduce the impact of subjects (groups A–
F) on the evaluation. Each subject evaluates different refactoring solutions for three
different systems.
Subjects (groups A–F) were aware that they are going to evaluate the design co-
herence of refactoring operations, but do not know the particular experiment research
questions (algorithms used, different objectives used and their combinations). Con-
sequently, each group of subjects who accepted to participate to the study, received a
questionnaire, a manuscript guide to help them to fill the questionnaire, and the source
code of the studied systems in order to evaluate six solutions (10 refactorings per solu-
tion). The questionnaire is organized within a spreadsheet with hyperlinks to visualize
easily the source code of the affected code elements. Subjects are invited to select for
each refactoring operation one of the possibilities: “Yes” (coherent change), “No” (non-
coherent change), or “May be” (if not sure). All the study material is available in Deb
et al. [2002]. Since the application of refactorings to fix design defects is a subjective
process, it is normal that not all the programmers have the same opinion. In our case,
we considered the majority of votes to determine whether a suggested refactoring is
correct.
Scenario 2: The aim of this scenario is to compare our refactoring results for fixing
design defects suggested by our tool with manual refactorings identified by developers.
Thereafter, we asked two groups of subjects (groups G and H) to fix a set of 72 design
defect instances that are randomly selected from each subject system (12 defects per
system) covering all the six different defect types considered. Then we compared their
sequences of refactorings that are suggested manually with those proposed by our
approach. The more our refactorings are similar to the manual ones, the more our tool
is assessed to be useful and efficient in practice.
5.2.4. Algorithms Configuration. In our experiments, we use and compare different mono-
and multi-objective algorithms. For each algorithm, to generate an initial population,
we start by defining the maximum vector length (maximum number of operations
per solution). The vector length is proportional to the number of refactorings that are
considered, the size of the program to be refactored, and the number of detected design
defects. A higher number of operations in a solution does not necessarily mean that
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:24 A. Ouni et al.
the results will be better. Ideally, a small number of operations should be sufficient to
provide a good tradeoff between the fitness functions. This parameter can be specified
by the user or derived randomly from the sizes of the program and the employed
refactoring list. During the creation, the solutions have random sizes inside the allowed
range. For all algorithms, NSGA-II, MOGA, RS), and GA, we fixed the maximum vector
length to 700 refactorings, and the population size to 200 individuals (refactoring
solutions), and the maximum number of iterations to 6,000 iterations. We also designed
our NSGA-II adaptation to be flexible in a way that we can configure the number of
objectives and which objectives to consider in the execution.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:25
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:26 A. Ouni et al.
Table IX. Empirical Study Results on 31 Runs (Median & STDev). The Results Were Statistically Significant on
31 Independent Runs using the Wilcoxon Rank Sum Test with a 95% Confidence Level (p − value < 0.05) in
Terms of Defect Correction Ratio (DCR), Code Changes Score, Refactoring Precision (RP), and RP-Automatic
code before and after applying the refactorings. We believe that this required time is
quite acceptable compared to the time that the developer may spend to identify these
refactoring opportunities manually from hundreds or thousands of classes and millions
of lines of code. In addition, while the effect of refactoring is clearly translated by fixing
the vast majority of design defects (84%) and significantly improving quality factors
(see Section 6.1), other effects on the systems quality (maintainability, extendibility,
etc.) cannot be assessed immediately.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:27
Table X. Median Refactoring Results and Standard Deviation (STDev) of Different Objective Combinations with
NSGA-II (Average of All the Systems) on 31 Runs in Terms of Defect Correction Ratio (DCR), Refactoring
Precision (RP), Code Changes Reduction, and Recorded Refactorings (RR). The Results Were Statistically
Significant on 31 Independent Runs Using the Wilcoxon Rank Sum Test with a 95% Confidence Level
(p − value < 0.05)
Objectives DCR Code changes RR RP (empirical
combinations Median STDev Median STDev Median STDev evaluation)
Q + CC 75% 1.84 2591 87.12 N.A. N.A. 45%
Q+S 81% 1.93 4355 94.6 N.A. N.A. 82%
Q + RC 85% 2.16 3989 89.76 41% 2.87 54%
Q + S + RC 81% 1.56 3888 106.24 35% 3.21 84%
Q + S + RC + CC 84% 2.39 2917 97.91 36% 3.82 80%
10 Notethat only for the RP metric did we not report the standard deviation as we directly conducted the
qualitative evaluation with subjects on the suggested refactoring solution having the median DCR score
from 31 independent runs.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:28 A. Ouni et al.
Fig. 7. Refactoring results of different objectives combination with NSGA-II in terms of (a) code changes
reduction (CC), (b) design preservation (RP), and (c) defects correction ratio (DCR).
To answer RQ1.4, we present the obtained results in Figure 7(b). The best RP
scores are obtained when the recorded code changes (RC) are considered (Q+S+RC),
while having good correction ration DCR (Figure 7(c)). In addition, we need more
quantitative evaluation to investigate the effect of the use of recorded refactorings on
the design coherence (RP). To this end, we compare the RP score with and without using
recorded refactorings. In most of the systems when recorded refactoring is combined
with semantics, the RP value is improved. For example, for Apache Ant RP is 83%
when only quality and semantics are considered; however, when recorded refactoring
reuse is included, the RP is improved to 87% (Figure 7(b)).
We notice also that when code changes reduction is included with quality, semantics,
and recorded changes, the RP and DCR scores are not significantly affected. Moreover,
we notice in Figure 7(c) that there is no significant variation in terms of DCR with
all different objectives combinations. When four objectives are combined, the DCR
value induces a slight degradation with an average of 82% in all systems, which is
considered a promising result. Thus, the slight loss in the defect-correction ratio is
largely compensated by the significant improvement of the design coherence and code
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:29
changes reduction. Moreover, we found that the optimal refactoring solutions found by
our approach are obtained with a considerable percentage of reused refactoring history
(RR) (more than 35% as shown in Table X). Thus, the obtained results support the
claim that recorded refactorings applied in the past are useful to generate coherent and
meaningful refactoring solutions and can effectively drive the refactoring suggestion
task.
In conclusion, we found that the best compromise is obtained between the four
objectives using NSGA-II comparing to the use of only two or three objectives. By
default, the tool considers the four objectives to find refactoring solutions. Thus, a
software engineer can consider the multi-objective algorithm as a black-box and he or
she does not need to configure anything related to the objectives to consider. The four
objectives should be considered and there is no need to select the objectives by the user
based on our experimentation results.
Results for RQ2: To answer RQ2, we evaluate the efficiency of our approach com-
paring to two other contributions of Harman and Tratt [2007] and Kessentini et al.
[2011]. Harman and Tratt [2007] proposed a multi-objective approach that uses two
quality metrics to improve CBO and SDMPC after applying the refactorings sequence.
In Kessentini et al. [2011], a single-objective genetic algorithm is used to correct de-
fects by finding the best refactoring sequence that reduces the number of defects. The
comparison is performed in terms of (1) defect correction ratio (DCR) that is calculated
using defect detection rules, (2) refactoring precision (RP) that represents the results of
the subject judgments (Scenario 1), and (3) code changes needed to apply the suggested
refactorings. We adapted our technique for calculating code changes scores for both
approaches by Harman and Tratt [2007] and Kessentini et al. [2011]. Table VIII sum-
marizes our findings and reports the median values and standard deviation (STDev)
of each of our evaluation metrics obtained for 31 simulation runs of all projects.
As described in Table IX, after applying the proposed refactoring operations, we
found that more than 84% of detected defects were fixed (DCR) as an average for all
the six studied systems. This score is comparable to the correction score of Kessentini
et al. (89%), an approach that does not consider design coherence preservation, code
change reduction, or recorded refactorings reuse (DCR is not considered in Harman
et al. since their aim is to improve only some quality metrics).
Regarding the semantic coherence, for all of our six studied systems, an average of
80% of proposed refactoring operations are considered as semantically feasible and do
not generate design incoherence. This score is significantly higher than the scores of
the two other approaches having, respectively, only 36% and 34% as RP scores. Thus,
our approach performs clearly better for RP and code changes score with the cost of a
slight degradation in DCR compared to Kessentini et al. This slight loss in the DCR is
largely compensated by the significant improvement in terms of design coherence and
code change reduction.
We compared the three approaches in terms of automatic RErecall , as depicted in
Figure 8. We found that a considerable number of proposed refactorings, an average
of 36% for all studied systems in terms of recall, are already applied to the next
version by the software development team. By comparison, the results for Harman
et al. and Kessentini et al. are only 4% and 9%, respectively, as reported in Figure 8(b).
Moreover, this score shows that our approach is useful in practice unlike both other
approaches. In fact, the RErecall of Harman et al. is not significant, since only the move
method refactoring is considered when searching for refactoring solutions to improve
coupling and standard deviation of methods per class. Moreover, expected refactorings
are related to not only quality improvement but also to adding new functionalities and
other maintenance tasks. This is not considered in our approach when we search for the
optimal refactoring solution that satisfies our four objectives. However, we manually
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:30 A. Ouni et al.
Fig. 8. Automatic refactoring score (RE_recall) comparison between our approach (NSGA-II), Harman et al.
and Kessentini et al.
inspected expected refactorings, and we found that they are mainly related to adding
new functionality (related to adding new packages, classes, or methods).
In conclusion, our approach produces good refactoring suggestions in terms of defect-
correction ratio, design coherence, and code change reduction from the point of view of
(1) potential users of our refactoring tool and (2) expected refactorings applied to the
next program version.
Furthermore, to justify the use of NSGA-II, we compared the performance of our
proposal to two other multi-objective algorithms: MOGA and a random search and a
mono-objective algorithm (genetic algorithm). In a random search, the change operators
(crossover and mutations) are not used, and populations are generated randomly and
evaluated using the four objective functions. In our mono-objective adaptation, we
considered a single fitness function, which is the normalized average score of the
four objectives using a genetic algorithm. Moreover, since in our NSGA-II adaptation
we select a single solution without giving more importance to some objectives, we
give equal weights for each fitness function value. As shown in Figure 9, NSGA-II
outperforms significantly MOGA, random-search, and the mono-objective algorithm in
terms of DCR, semantic coherence preservation (RP), and code change reduction. For
instance, in JFreeChart, NSGA-II performs much better than MOGA, random search,
and genetic algorithm in terms of DCR and RP scores (respectively, Figures 9(a) and
9(b)). In addition, NSGA-II reduces significantly code changes for all studied systems.
For example, for Rhino, the number of code changes was reduced to almost the half
compared to random search as shown in Figure 9(c).
Furthermore, an interesting finding is that the RS works as well as the single-
objective GA. Indeed, we used RS with a multi-objective version by switching off indi-
vidual selection based on fitness value in our original framework. The performance of
RS is clearly less than the other multi-objective algorithms being compared (NSGA-II
and MOGA). Some of the results of RS can be considered acceptable, and this can be
explained by the limited number of refactoring types considered in our experiments
(limited search space). For GA, after 2,000 generations, we noticed that the search
produced entire populations with high DCR and CC values but lower S and RR values,
which has resulted in a relative increase in the combined fitness function that led
to comparable results to the multi-objective RS. The comparable results between RS
and GA suggest that our formulation to the refactoring recommendation problem as a
multi-objective formulation is adequate.
Another interesting observation from the results in Figure 9 is that MOGA has fewer
code changes and higher RP values than NSGA-II in Apache Ant. By looking at the
produced results, we noticed that none of the blob design defects was fixed in Apache
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:31
Fig. 9. Refactoring results of different algorithms NSGA-II, MOGA, GA, and RS in terms of (a) defects
correction ratio, (b) refactoring precision, and (c) code changes reduction.
Ant using MOGA. Indeed, the blob design defect is known as one of the most difficult
design defects to fix and typically requires a large number of refactoring operations and
code changes (several extract class, move method and move field refactorings). This is
also explained by the higher RP score, as it also complicated for developers to approve
such refactorings.
For all experiments, we obtained a large difference between the NSGA-II results
and the mono-objective approaches (Harman et al., Kessentini et al., GA and random
search) using all the evaluation metrics. However, when comparing NSGA-II against
MOGA, we have found the following results: (a) On small- and medium-scale software
systems (JFreeChart, Rhino, and GanttProject), NSGA-II is better than MOGA on most
systems with a small and medium effect size and (b) on large-scale software systems
(Xerces-J, Apache Ant, and JDI-Ford), NSGA-II is better than MOGA on most systems
with a high effect size.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:32 A. Ouni et al.
Fig. 10. Comparison results of our approach (NSGA-II) with JDeodorant in terms of defects correction ratio
(DCR), design coherence (RP), and code changes score (CC) for each system. For NSGA-II, we report the
average DCR and CC scores and standard deviation obtained through 31 independent runs. Note that for
RP score, we did not report the standard deviation as we directly conducted the qualitative evaluation with
subjects on the suggested refactoring solution that have the median DCR score.
Results for RQ3: JDeodorant uses only structural information to detect and fix
design defects but does not handle all the six design defect types that we considered
in our experiments. Thus, to make the comparison fair, we performed our compari-
son using only two design defects that can be fixed by both tools: blob and feature
envy. Figure 10 summarizes our findings for the blob (Figure 10(a)) and feature envy
(Figure 10(b)). It is clear that our proposal outperforms JDeodorant, on average, on
all the systems in terms of the number of fixed defects with a minimum number of
changes and semantically coherent refactorings. The average number of fixed code
smells is comparable between both tools. However, our approach is clearly better in
terms of semantically coherent refactorings. This can be explained by the fact that
JDeodorant uses only structural metrics to evaluate the impact of suggested refac-
torings on the detected code smells. In addition, our proposal supports more types of
refactorings than JDeodorant, and this also explains our outperformance. However,
one of the advantages of JDeodorant is that the suggested refactorings are easier to
apply than those proposed by our technique as it provides an Eclipse plugin to suggest
and then automatically apply a total of four types of refactorings, while the current
version of our tool requires us to apply refactorings by the developers using the Eclipse
IDE with more complex types of refactorings.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:33
Fig. 11. Comparison of our refactoring results with manual refactorings in terms of Precision.
Results for RQ4: To evaluate the relevance of our suggested refactorings with our
subjects, we compared the refactoring strategies proposed by our technique and those
proposed manually by groups G and H (six subjects) to fix several defects on the six
systems. Figure 11 shows that most of the suggested refactorings by NSGA-II are
similar to those applied by developers with an average of more than 73%. Some defects
can be fixed by different refactoring strategies, and the same solution can be expressed
in different ways (complex and atomic refactorings). Thus we consider that the average
precision of more than 73% confirms the efficiency of our tool for real developers to
automate the refactoring recommendation process. We discuss, in the next section, in
more detail the relevance of our automated refactoring approach for software engineers.
6. DISCUSSIONS
The obtained results from Section 5.3 suggest that our approach performs better than
two existing approaches. We also compared different objective combinations and found
that the best compromise is obtained between the four objectives using NSGA-II when
compared to the use of only two or three objectives. Therefore, our four objectives
are efficient for providing “good” refactoring suggestions. Moreover, we found that the
results achieved by NSGA-II outperforms the ones achieved by both multi-objective
algorithms, MOGA and random search, and the mono-objective algorithm, GA.
We now provide more quantitative and qualitative analyses of our results and discuss
some observations drawn from our empirical evaluation of our refactoring approach.
We aim at answering the following research questions:
—RQ5: What is the effect of suggested refactorings on the overall quality of systems?
—RQ6: What is the effect of multiple executions on the refactoring results?
—RQ7: What is the distribution of the suggested refactoring types?
In the following subsections we answer each of these research questions.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:34 A. Ouni et al.
the advantage of defining six high-level design quality attributes (reusability, flexibility,
understandability, functionality, extendibility, and effectiveness) that can be calculated
using 11 lower-level design metrics [Bansiya and Davis 2002]. In our study we consider
the following quality attributes:
—Reusability: The degree to which a software module or other work product can be
used in more than one computer program or software system.
—Flexibility: The ease with which a system or component can be modified for use in
applications or environments other than those for which it was specifically designed.
—Understandability: The properties of designs that enable it to be easily learned and
comprehended. This directly relates to the complexity of design structure.
—Effectiveness: The degree to which a design is able to achieve the desired functionality
and behavior using OO design concepts and techniques.
We did not assess the issue of functionality because we assume that, by definition,
refactoring does not change the behavior/functionality of systems; instead, it changes
the internal structure. We have also excluded the extendibility factor because it is, to
some extent, a subjective quality factor and using a model of merely static measures
to evaluate extendibility is inadequate. Tables XI and XII summarize the QMOOD
formulation of these quality attributes [Bansiya and Davis 2002].
The improvement in quality can be assessed by comparing the quality before and
after refactoring independently to the number of fixed design defects. Hence, the total
gain in quality G for each of the considered QMOOD quality attributes qi before and
after refactoring can be easily estimated as:
Gqi = qi − qi (17)
where qi and qi denotes the value of the quality attribute i, respectively, after and
before refactoring.
In Figure 12, we show the obtained gain values (in terms of absolute value) that
we calculated for each QMOOD quality attribute before and after refactoring for each
studied system. We found that the systems quality increase across the four QMOOD
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:35
Fig. 12. The impact of best refactoring solutions on QMOOD quality attributes.
quality factors much better than existing approaches. Understandability is the quality
factor that has the highest gain value, whereas the Effectiveness quality factor has
the lowest one. This mainly due to many reasons: (1) the majority of fixed design
defects (blob, spaghetti code) are known to increase the coupling (DCC) within classes,
which heavily affects the quality index calculation of the Effectiveness factor, and
(2) the vast majority of suggested refactoring types were move method, move field,
and extract class (Figure 12), which are known to have a high impact on coupling
(DCC), cohesion (CAM), and the design size in classes (DSC) that serves to calculate
the understandability quality factor. Furthermore, we noticed that JHotDraw produced
the lowest-quality increase for the four quality factors. This is justified by the fact that
JHotDraw is known to be of good design and implementation practices [Kessentini
et al. 2010] and it contains a small number of design defects compared to the five other
studied systems.
To summarize, we can conclude that our approach succeeded in improving the code
quality not only by fixing the majority of detected design defects but also by improving
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:36 A. Ouni et al.
Fig. 13. Results of multiple executions on different project in terms of defect correction ratio (DCR), code
changes (CC), reused refactorings (RR), and execution time (Time).
the user understandability, the reusability, the flexibility, as well as the effectiveness
of the refactored program.
Finally, it is worth noticing that since the application of refactorings to fix design
defects is a subjective process, it is normal that not all the programmers have the
same opinion. Thus it is important to study the level of agreement between subjects. To
address this issue, we evaluated the level of agreement using Cohen’s Kappa coefficient
κ [Cohen et al. 1960], which measures to what extent the subjects agree when voting
for a recommended refactoring operation. The Kappa coefficient assessments was 0.78,
which is characterized as “substantial agreement” by Landis and Koch [1977]. This
obtained score makes us more confident that our suggested refactorings are meaningful
from software engineer’s perspective.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:37
Fig. 14. Impact of the number of design defects on the size of the refactoring solution (number of
refactorings).
Fig. 15. Comparison of the execution time for each of the search techniques: NSGA-II, Harman et al.,
Kessentini et al., MOGA, GA, and RS.
In addition, Figure 15 reports the execution time for each of the following search
algorithms: NSGA-II, Harman et al., Kessentini et al., MOGA, GA, and RS. As shown
in the figure, the execution time of our NSGA-II approach was very similar to MOGA,
with an average of less than 48 minutes per system. However, the execution time of
random search was half of the time spent by NSGA-II and MOGA, but the quality of
the random search solutions are much lower. The performance of NSGA-II is slightly
better than MOGA based on the different evaluation metrics. However, the adaptation
of an NSGA-II algorithm to our refactoring problem is more complex than MOGA.
It is expected that the execution time of the remaining mono-objective approach is
almost half the NSGA-II one due to the following reasons: (1) they just considered
one objective function; (2) the time-consuming semantics and history functions of our
approach are not considered by the existing mono-objective approaches that require
additional time processing, filtering, and comparison of the identifiers within classes;
and (3) existing mono-objective approaches are limited to few types of refactorings.
Since our refactoring problem is not a real-time one, the execution time of NSGA-
II is considered acceptable by all the programmers of our experiments. In fact, they
mentioned that it is not required to use the tool daily and they can execute it at the
end of the day and check the results the next day.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:38 A. Ouni et al.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:39
The software engineers from Ford manually evaluated all the recommended refac-
torings for JDI by our tool using the RP metric, described in the previous section, based
on their knowledge of the system since they are some of the original developers. We
also evaluated the relevance of some of the suggested refactoring for the developers. In
addition, we asked 4 of the 10 software engineers from Ford to manually refactor some
code fragments with a poor quality, and then we compared their suggested refactorings
with the recommended ones by our approach. To make decisions about the quality of a
code fragment, we used the domain knowledge of the 10 programmers from Ford (since
they are part of the original developers of the systems) and the quality metrics and
detected design defects (to guide developers to identify a list of refactoring opportuni-
ties). Thus, we defined a metric called ER that represents the ratio of the number of
good refactoring recommendation over the number of expected refactorings. The four
selected software engineers are part of the original developers of the JDI system, and,
thus, they easily provided different refactoring suggestions.
In this section, we aim to answer the following two questions:
(1) To what extent can our approach propose correct refactoring recommendations?
(2) To what extent the suggested refactorings are relevant and useful for software
engineers?
We describe, first, in this section the subjects participated in our study. Second,
we give details about the questionnaire, instructions, and the conducted pilot study.
Finally, we describe and discuss the obtained results.
7.1. Subjects
Our study involved 10 software engineers from the Ford Motor Company. All the sub-
jects are familiar with Java development and software maintenance activities, includ-
ing refactoring. The experience of these subjects on Java programming ranged from 4
to 17 years. They were selected, as part of a project funded by Ford, based on having
similar development skills, their motivations to participate in the project, and their
availability. They are part of the original developers’ team of the JDI system.
7.2. Pilot Study
Subjects were first asked to fill out a pre-study questionnaire containing six ques-
tions. The questionnaire helped to collect background information such as their role
within the company, their contribution in the development of JDI, their programming
experience, their familiarity with quality assurance, and software refactoring.
We divided the subjects into five groups (two developers per group) to evaluate
the correctness and the relevance of the recommended refactorings according to the
number of refactorings to evaluate and the results of the pre-study questionnaire. All
the groups are similar, on average, in terms of programming experience and familiarity
with the system and used tools, and have almost the same refactoring and code smells
background. The study consists of two parts:
(1) The first part of the questionnaire includes questions to evaluate the correctness of
the recommended refactoring using the following options: 1. Not correct; 2. Maybe
Correct; and 3. Correct.
(2) The second part of the questionnaire includes questions around the relevance of
the recommended refactorings using the following scale: 1. Not at all relevant; 2.
Slightly relevant; 3. Moderately relevant; and 4. Extremely relevant.
The questionnaire is completed anonymously, thus ensuring confidentiality and this
study was approved by the IRB at the University of Michigan: “Research involving
the collection or study of existing data, documents, records, pathological specimens, or
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:40 A. Ouni et al.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:41
Fig. 18. Quality improvements on JDI-Ford after applying the recommended refactorings.
In fact, the incorrect refactorings are due to some generated conflicts related to
the fact that we are combining both complex and atomic refactorings in our solution.
Although our repair operator eliminates the detected identical redundant refactorings
within one solution, it is challenging to detect such issue when dealing with complex
and atomic refactorings. For example, an extract class is composed by several atomic
refactorings such create new class, move methods, move attributes, redirect method
calls, and so on. Thus, it is challenging to eliminate some conflicts between atomic and
complex refactorings when it is a redundancy issue. A possible solution is to convert
all complex refactorings to atomic ones, and then we can perform the comparison to
detect the redundancy. However, the conversion process is not straightforward since one
complex refactoring can be translated in different ways in terms of atomic refactorings.
To better investigate the relevance of the recommended refactorings, we evaluated
their impact on the quality of JDI-Ford based on QMOOD. Figure 18 depicts the qual-
ity attributes improvements of the JDI system after applying (i) all recommended
refactorings (104 refactorings) and (ii) only the selected refactorings by the developers
(87 refactorings). The obtained results show that our approach succeeded in improving
different aspect of software quality, including reusability (0.11 of improvement), flex-
ibility (0.13), understandability (0.34), and effectiveness (0.09). An interesting point
here is that the results achieved by the selected refactorings (87 of 104) outperform
the ones achieved by all the recommended refactorings (104) in terms of understand-
ability and reusability. This finding provides evidence that although developers seek
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:42 A. Ouni et al.
to improve the overall quality of their code, they are prioritizing the understandability
and reusability over other quality aspects. Indeed, we expected that developers would
mainly apply refactorings that improve the readability and understandability of their
code.
Moreover, we asked the developers to evaluate the relevance of the recommended
refactorings for the JDI-Ford system. Only less than 5% of recommended refactorings
are considered not at all relevant by the software engineers, 7% are considered as
slightly relevant, 19% are moderately relevant, while 69% are considered as extremely
relevant. Moreover, the assessment of the Cohen’s Kappa coefficient κ [Cohen et al.
1960], which measures to what extent the developers agree when voting for a rec-
ommended refactoring, indicates a score of κ = 0.79. This significant score indicates
“substantial agreement” as characterized by Landis and Koch [1977]. This confirms
the importance of the recommended refactorings for developers whereby they need to
apply them for better quality of their systems.
To get more insights about the 5% of refactorings that are voted as “not at all
relevant,” we asked the developers to comment on some particular cases. We noticed
that most of these rejected refactoring were related to utility classes in JDI, where
move method refactorings are suggested to move some utility methods to the classes
that are calling them. Developers mentioned that this kind of refactoring tends to be
meaningless.
To better evaluate the relevance of the recommended refactorings, we investigated
the types of refactorings that developers may consider more or less important than
others. Figure 19 shows that move method is considered as one of the most extremely
relevant refactorings. In addition, extract method is also considered as another very
important and useful refactoring. This can be explained by the fact that the developers
are more interested in fixing quality issues that are related to the size of classes
or methods. Overall, the different types of refactorings are considered relevant. One
reason can be that our approach provides a sequence of refactorings.
It was clear for our participants that our tool can provide faster and similar results
that they can manually suggest. It can be time-consuming to fix several quality issues
in large-scale systems via refactoring. The participants provided some suggestions to
make our refactoring better and more efficient. First, the tool does not provide any
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:43
ranking to prioritize the suggested refactorings. In fact, the developers do not have
enough time to apply all the suggested refactorings but they prefer to fix the most
severe quality issues. Second, our technique does not provide support to fix refactoring
solutions when the developers did not approve part of the suggested refactorings.
Finally, the software engineers prefer that our tool provides a feature to automatically
apply some regression testing techniques to generate test cases for the modified code
fragments after refactoring. Such a feature is very interesting to include in our tool to
automatically test the Java refactoring engine similarly to SafeRefactor [Soares et al.
2013].
8. THREATS TO VALIDITY
Some potential threats can affect the validity of our experiments. We now discuss these
potential threats and how we deal with them.
Construct validity concerns the relation between the theory and the observation.
In our experiments, the design defect detection rules [Ouni et al. 2012a] we use to
measure DCR could be questionable. To mitigate this threat, we manually inspect and
validate each detected defect. Moreover, our refactoring tool configuration is flexible
and can support other state-of-the-art detection rules. In addition, different threshold
values were used in our experiments based on trial and error; however, these values
can be configured once and then used independent of the system being evaluated.
Another threat concerns the data about the actual refactorings of the studied systems.
In addition to the documented refactorings, we are using Ref-Finder, which is known
to be efficient [Prete et al. 2010]. Indeed, Ref-Finder was able to detect refactoring
operations with an average recall of 95% and an average precision of 79% [Prete et al.
2010]. To ensure the precision, we manually inspect the refactorings found by Ref-
Finder. We identify three threats to internal validity: selection, learning and fatigue,
and diffusion.
For the selection threat, the subject diversity in terms of profile and experience could
affect our study. First, all subjects were volunteers. We also mitigated the selection
threat by giving written guidelines and examples of refactorings already evaluated with
arguments and justification. Additionally, each group of subjects evaluated different
refactorings from different systems for different techniques/algorithms. Furthermore,
the selection refactorings to be evaluated for each refactoring solution was completely
random.
Randomization also helps to prevent learning and fatigue threats. For the fatigue
threat, specifically, we did not limit the time to fill out the questionnaire for the open-
source systems. Consequently, we sent the questionnaires to the subjects by email
and gave them enough time to complete the tasks. Finally, only 10 refactorings per
system were randomly picked for the evaluation. However, all refactoring solutions
were evaluated for the industrial system.
Diffusion threat is limited in our study because most of the subjects are geograph-
ically located in three different universities and a company, and the majority do not
know each other. For the few who are in the same location, they were instructed not to
share information about the experience prior to the study.
Conclusion validity deals with the relation between the treatment and the out-
come. Thus, to ensure the heterogeneity of subjects and their differences, we took
special care to diversify them in terms of professional status, university/company af-
filiations, gender, and years of experience. In addition, we organized subjects into
balanced groups. That said, we plan to test our tool with Java development companies
to draw better conclusions. Moreover, the automatic evaluation is also a way to limit
the threats related to subjects as it helps to ensure that our approach is efficient and
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:44 A. Ouni et al.
useful in practice. Indeed, we compare our suggested refactorings with the expected
ones that are already applied to the next releases and detected using Ref-Finder.
Another potential threat can be related to parameters selection. We selected different
parameters of our NSGA-II algorithm, such as the population size, the maximum
number of iterations, mutation and crossover probabilities, and the solution length,
based on the trial-and-error method and depending on the size of the evaluated systems,
the initial number of design defect instances detected, and the number of refactoring
types implemented in our tool (11 types, Table II). However, as these parameters are
independent each other, they can be easily configured according to the preferences of
the developers, for example, if they want to reduce the execution time (e.g., reduce the
number of iterations) and maybe sacrifice a bit on the quality of the solutions.
Also when comparing the different approaches, some of them are using fewer types of
refactorings. We believe that this is one of the limitations of these approaches; thus, it
is interesting to show that considering the 11 types of refactorings of our approach may
improve the results (even if programmers may apply them less frequently). Further-
more, when comparing the different approaches from the effort perspective, the code
changes score is relative to DCR level. Not all design defects require the same amount
of code changes. The process prioritizes the correction of design defects that require
fewer changes to have higher DCR score. In addition, our results were consistent on
all the different DCR levels for all the systems.
External validity refers to the generalizability of our findings. In this study, we
performed our experiments on different open-source and industrial Java systems be-
longing to different application domains and with different sizes. However, we cannot
assert that our results can be generalized to other programming languages and to other
practitioners.
The industrial validation section was checked by the Ford Motor Company. Our in-
dustrial partner agreed to only include the results mentioned in the current validation
section for several reasons. Similarly to most collaborations with industry, we are not
allowed to mention the name of the code elements or to provide examples from the
source code. One of our motivations to use open-source systems in our validation is the
hard constraint to not share the industrial data. Thus, the readers can at least check
different examples of suggested refactorings on the open-source system in the website
provided with this article.
9. RELATED WORK
Several studies have been focused on software refactoring in recent years. In this sec-
tion, we survey those works that can be classified into three broad categories: (i) man-
ual and semi-automated approaches, (ii) search-based approaches, and (iii) semantics-
based approaches.
9.1. Manual and Semi-Automated Approaches
The first book in the literature was written by Fowler [1999], and it provides a non-
exhaustive list of low-level design problems in source code. For each design problem (i.e.,
design defect), a particular list of possible refactorings are suggested to be applied by
software maintainers manually. After Fowler’s book was published, several approaches
have included the goal of taking advantage of refactoring to improve quality metrics
of software systems. Sahraoui et al. [2000] proposed an approach to detect opportu-
nities of code transformations (i.e., refactorings) based on the study of the correlation
between certain quality metrics and refactoring changes. Consequently, different rules
are manually defined as a combination of metrics/thresholds to be used as indicators
for detecting refactoring opportunities. For each code smell, a pre-defined and stan-
dard list of transformations should be applied. In contrast to our approach, we do not
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:45
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:46 A. Ouni et al.
classes. A similar technique was suggested by Fokaefs et al. [2012] to detect extract
class possibilities by analyzing dependencies between methods and classes. However,
such approaches are local, that is, they focus on a specific code fragment. In contrast to
our approach, we are providing a generic refactoring approach that considers the effect
on the whole system being refactored.
Furthermore, several empirical studies [Kim et al. 2014; Negara et al. 2013; Franklin
et al. 2013; Alves et al. 2014] have been performed recently to understand the benefits
and risks of refactoring. These studies showed that the main risk that refactorings
could introduce is the creation of bugs after refactoring, but several benefits could be
obtained, such as reducing the time that programmers spent to understand existing
implemented features.
More details about current literature related to manual or semi-automated software
refactoring can be found in the following two recent surveys: Bavota et al. [2014b] and
Al Dallal [2015].
9.2. Search-Based Approaches
To automate refactoring activities, new approaches have emerged where search-based
techniques have been used. These approaches cast the refactoring problem as an opti-
mization problem, where the goal is to improve the design quality of a system based
mainly on a set of software metrics. After formulating the refactoring as an optimiza-
tion problem, several techniques can be applied for automating refactoring, for exam-
ple, genetic algorithms, simulated annealing, and Pareto optimality. Hence, we classify
these approaches into two main categories: (1) mono-objective and (2) multi-objective
optimization.
In the first category, the majority of existing work combines several metrics into
a single fitness function to find the best sequence of refactorings. Seng et al. [2006]
propose a single-objective search-based approach using genetic algorithm to suggest a
list of refactorings to improve software quality. The search process uses a single fitness
function to maximize a weighted sum of several quality metrics. The employed metrics
are mainly related to various class-level properties such as coupling, cohesion, com-
plexity, and stability while satisfying a set of pre-conditions for each refactoring. These
conditions serve to preserve the program behavior (refactoring feasibility). However, in
contrast to our approach, this approach does not consider the design coherence of the
refactored program and is limited only to move method refactoring. Another similar
work [O’Keeffe and Cinnéide 2008] uses different local search-based techniques, such
as hill climbing and simulated annealing, to provide an automated refactoring support
based on the QMOOD metrics suite. Interestingly, they also found that the understand-
ability function yielded the greatest quality gain, in keeping with our observation in
Section 6.2.
Qayum and Heckel [2009] considered the problem of refactoring scheduling as a
graph transformation problem. They expressed refactorings as a search for an optimal
path, using Ant Colony Optimization, in the graph where nodes and edges represent,
respectively, refactoring candidates and dependencies between them. However, the use
of graphs is limited only on structural and syntactical information and considers nei-
ther the design semantics nor its runtime behavior. Fatiregun et al. [2004] show how
search-based transformations could be used to reduce code size and construct amor-
phous program slices. However, they have used small atomic-level transformations in
their approach. However, their aim was to reduce program size rather than to improve
its structure/quality.
Otero et al. [2010] introduced an approach to explore the addition of a refactoring
step into the genetic programming iteration. It consists of an additional loop in which
refactoring steps, drawn from a catalog, will be applied to individuals of the population.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:47
Jensen and Cheng [2010] have proposed an approach that supports composition of
design changes and makes the introduction of design patterns a primary goal of the
refactoring process. They used genetic programming and software metrics to identify
the most suitable set of refactorings to apply to a software design. Furthermore, Kilic
et al. [2011] explored the use of a variety of population-based approaches to search-
based parallel refactoring, finding that local beam search could find the best solutions.
However, these approach still focused on specific refactoring types and did not consider
the design semantics.
Zibran and Roy [2011] formulated the problem of scheduling of code clone refactoring
activities as a constraint satisfaction optimization problem to fix known duplicate
code smells. The proposed approach consists of applying a constraint programming
technique that aims to maximize benefits while minimizing refactoring efforts. An
effort model is used to estimate the effort required to refactor code clones in an object-
oriented codebase. Although there is a slight similarity between the proposed effort
model and our code changes score model [Ouni et al. 2012a], the proposed approach
does not ensure the design coherence of the refactored program.
In the second category, the first multi-objective approach was introduced by Harman
and Tratt [2007] as described earlier. Recently, Ó Cinnéide et al. [2012] have proposed a
multi-objective search-based refactoring to conduct an empirical investigation to assess
some structural metrics and to explore relationships between them. To this end, they
have used a variety of search techniques (Pareto-optimal search, semi-random search)
guided by a set of cohesion metrics. The main weakness in all of these approaches
is that neither the design preservation to obtain correct and meaningful refactorings,
nor the effort required to apply the refactoring presented in our approach has been
addressed.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:48 A. Ouni et al.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:49
ACKNOWLEDGMENTS
The authors thank the anonymous reviewers for their relevant feedback and useful comments that helped
them to improve this work. Also, the authors are grateful to the subject students from the University of
Michigan-Dearborn and developers who participated in their empirical study.
REFERENCES
Jehad Al Dallal. 2015. Identifying refactoring opportunities in object-oriented code: A systematic literature
review. Inform. Softw. Technol. 58 (2015), 231–249.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:50 A. Ouni et al.
Everton L. G. Alves, Myoungkyu Song, and Miryung Kim. 2014. RefDistiller: A refactoring aware code
review tool for inspecting manual refactoring edits. In 22nd ACM SIGSOFT International Symposium
on Foundations of Software Engineering (FSE). ACM, New York, NY, 751–754.
Raquel Amaro, Rui Pedro Chaves, Palmira Marrafa, and Sara Mendes. 2006. Enriching wordnets with new
relations and with event and argument structures. In Computational Linguistics and Intelligent Text
Processing. Springer, Berlin, 28–40.
Nicolas Anquetil and Jannik Laval. 2011. Legacy software restructuring: Analyzing a concrete case. In 15th
European Conference on Software Maintenance and Reengineering (CSMR). 279–286.
Thomas Baar and Slaviša Marković. 2007. A graphical approach to prove the semantic preservation of
UML/OCL refactoring rules. In Perspectives of Systems Informatics. Springer, Berlin, 70–83.
Jagdish Bansiya and Carl G. Davis. 2002. A hierarchical model for object-oriented design quality assessment.
IEEE Trans. Softw. Eng. 28, 1 (2002), 4–17.
Gabriele Bavota, Andrea De Lucia, Andrian Marcus, and Rocco Oliveto. 2014a. Automating extract class
refactoring: An improved method and its evaluation. Empir. Softw. Eng. 19, 6 (2014), 1617–1664.
Gabriele Bavota, Andrea De Lucia, Andrian Marcus, and Rocco Oliveto. 2014b. Recommending refactoring
operations in large software systems. In Recommendation Systems in Software Engineering. Springer,
Berlin, 387–419.
Gabriele Bavota, Andrea De Lucia, Andrian Marcus, Rocco Oliveto, and Fabio Palomba. 2012. Supporting
extract class refactoring in Eclipse: The ARIES project. In 34th International Conference on Software
Engineering (ICSE). IEEE Press, Los Alamitos, CA, 1419–1422.
Gabriele Bavota, Andrea De Lucia, and Rocco Oliveto. 2011. Identifying extract class refactoring opportuni-
ties using structural and semantic cohesion measures. J. Syst. Softw. 84, 3 (2011), 397–414.
Gabriele Bavota, Malcom Gethers, Rocco Oliveto, Denys Poshyvanyk, and Andrea de Lucia. 2014. Improving
software modularization via automated analysis of latent topics and dependencies. ACM Trans. Softw.
Eng. Methodol. 23, 1 (2014), 4.
Gabriele Bavota, Rocco Oliveto, Andrea De Lucia, Giuliano Antoniol, and Yann-Gaël Guéhéneuc. 2010. Play-
ing with refactoring: Identifying extract class opportunities through game theory. In 26th International
Conference on Software Maintenance (ICSM). 1–5.
Gabriele Bavota, Rocco Oliveto, Malcom Gethers, Denys Poshyvanyk, and Andrea De Lucia. 2014a. Method-
book: Recommending move method refactorings via relational topic models. IEEE Trans. Softw. Eng. 40,
7 (2014), 671–694.
Gabriele Bavota, Sebastiano Panichella, Nikolaos Tsantalis, Massimiliano Di Penta, Rocco Oliveto, and
Gerardo Canfora. 2014b. Recommending refactorings based on team co-maintenance patterns. In 29th
International Conference on Automated Software Engineering (ASE). 337–342.
William H. Brown, Raphael C. Malveau, and Thomas J. Mowbray. 1998. AntiPatterns: Refactoring software,
architectures, and projects in crisis. (1998).
Mel O. Cinnéide. 2001. Automated Application of Design Patterns: A Refactoring Approach. Ph.D. Disserta-
tion. Trinity College Dublin.
Norman Cliff. 1993. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol. Bull. 114,
3 (1993), 494.
Jacob Cohen and others. 1960. A coefficient of agreement for nominal scales. Educat. Psychol. Measure. 20,
1 (1960), 37–46.
Anna Corazza, Sergio Di Martino, and Valerio Maggio. 2012. LINSEN: An efficient approach to split iden-
tifiers and expand abbreviations. In 28th International Conference on Software Maintenance (ICSM).
233–242.
Anna Corazza, Sergio Di Martino, Valerio Maggio, and Giuseppe Scanniello. 2011. Investigating the use of
lexical information for software system clustering. In 15th European Conference on Software Mainte-
nance and Reengineering (CSMR). 35–44.
Marco D’Ambros, Alberto Bacchelli, and Michele Lanza. 2010. On the impact of design flaws on software
defects. In 10th International Conference on Quality Software (QSIC). 23–31.
Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and Tamt Meyarivan. 2002. A fast and elitist multiobjective
genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6, 2 (2002), 182–197.
Ignatios Deligiannis, Martin Shepperd, Manos Roumeliotis, and Ioannis Stamelos. 2003. An empirical inves-
tigation of an object-oriented design heuristic for maintainability. J. Syst. Softw. 65, 2 (2003), 127–139.
K. Dhambri, H. Sahraoui, and P. Poulin. 2008. Visual detection of design anomalies. In 12th European
Conference on Software Maintenance and Reengineering (CSMR). 279–283.
Bart Du Bois, Serge Demeyer, and Jan Verelst. 2004. Refactoring-improving coupling and cohesion of existing
code. In 11th Working Conference on Reverse Engineering (WCRE). 144–151.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:51
Len Erlikh. 2000. Leveraging legacy system dollars for e-business. IT Professional 2, 3 (2000), 17–23.
Deji Fatiregun, Mark Harman, and Robert M. Hierons. 2004. Evolving transformation sequences using
genetic algorithms. In 4th International Workshop on Source Code Analysis and Manipulation (SCAM).
65–74.
Deji Fatiregun, Mark Harman, and Robert M. Hierons. 2005. Search-based amorphous slicing. In 12th
Working Conference on Reverse Engineering (WCRE). IEEE, Los Alamitos, CA.
Norman E. Fenton and Shari Lawrence Pfleeger. 1998. Software Metrics: A Rigorous and Practical Approach
(2nd ed.). PWS, Boston, MA.
Marios Fokaefs, Nikolaos Tsantalis, Eleni Stroulia, and Alexander Chatzigeorgiou. 2011. JDeodorant: Iden-
tification and application of extract class refactorings. In 33rd International Conference on Software
Engineering (ICSE). 1037–1039.
Marios Fokaefs, Nikolaos Tsantalis, Eleni Stroulia, and Alexander Chatzigeorgiou. 2012. Identification and
application of extract class refactorings in object-oriented systems. J. Syst. Softw. 85, 10 (2012), 2241–
2260.
Carlos M. Fonseca, Peter J. Fleming, and others. 1993. Genetic algorithms for multiobjective optimization:
Formulation, discussion and generalization. In 5th International Conference on Genetic Algorithms,
Vol. 93. 416–423.
Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman, Boston,
MA.
Lyle Franklin, Alex Gyori, Jan Lahoda, and Danny Dig. 2013. LAMBDAFICATOR: From imperative to
functional programming through automated refactoring. In 35th International Conference on Software
Engineering (ICSE). 1287–1290.
Xi Ge and Emerson Murphy-Hill. 2014. Manual refactoring changes with automated refactoring validation.
36th International Conference on Software Engineering (ICSE) 36 (2014), 1095–1105.
Mohamed Salah Hamdi. 2011. SOMSE: A semantic map based meta-search engine for the purpose of web
information customization. Appl. Soft Comput. 11, 1 (2011), 1310–1321.
Mark Harman, S. Afshin Mansouri, and Yuanyuan Zhang. 2012. Search-based software engineering: Trends,
techniques and applications. ACM Computing Surveys (CSUR) 45, 1 (2012), 11.
Mark Harman and Laurence Tratt. 2007. Pareto optimal search based refactoring at the design level. In 9th
Annual Conference on Genetic and Evolutionary Computation (GECCO). 1106–1113.
Paul Jaccard. 1901. Étude comparative de la distribution florale dans une portion des Alpes et des Jura.
Bull. Soc. Vaud. Sci. Natur. 37 (1901), 547–579.
Adam C. Jensen and Betty H. C. Cheng. 2010. On the use of genetic programming for automated refactor-
ing and the introduction of design patterns. In 12th Annual Conference on Genetic and Evolutionary
Computation (GECCO). 1341–1348.
Padmaja Joshi and Rushikesh K. Joshi. 2009. Concept analysis for class cohesion. In 13th European Confer-
ence on Software Maintenance and Reengineering (CSMR). Washington, DC, USA, 237–240.
Yoshio Kataoka, David Notkin, Michael D. Ernst, and William G. Griswold. 2001. Automated support for
program refactoring using invariants. In International Conference on Software Maintenance (ICSM).
IEEE Computer Society, Los Alamitos, CA, 736.
Marouane Kessentini, Wael Kessentini, Houari Sahraoui, Mounir Boukadoum, and Ali Ouni. 2011. Design
defects detection and correction by example. In 19th International Conference on Program Comprehen-
sion (ICPC). 81–90.
Marouane Kessentini, Stéphane Vaucher, and Houari Sahraoui. 2010. Deviance from perfection is a bet-
ter criterion than closeness to evil when identifying risky code. In 25th International Conference on
Automated Software Engineering (ASE). 113–122.
Hurevren Kilic, Ekin Koc, and Ibrahim Cereci. 2011. Search-based parallel refactoring using population-
based direct approaches. In 3rd International Symposium on Search Based Software Engineering (SS-
BSE). Springer, Berlin, 271–272.
Miryung Kim, Thomas Zimmermann, and Nachiappan Nagappan. 2014. An empirical study of refactoring
challenges and benefits at Microsoft. IEEE Trans. Softw. Eng. 40, 7 (2014), 633–649.
Giri Panamoottil Krishnan and Nikolaos Tsantalis. 2014. Unification and refactoring of clones. In Software
Evolution Week - IEEE Conference on Software Maintenance, Reengineering and Reverse Engineering
(CSMR-WCRE). 104–113.
J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data.
Biometrics (1977), 159–174.
Wei Li and Raed Shatnawi. 2007. An empirical study of the bad smells and class error probability in the
post-release object-oriented system evolution. J. Syst. Softw. 80, 7 (2007), 1120–1128.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
23:52 A. Ouni et al.
Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of Psychology (1932).
Francesco Logozzo and Agostino Cortesi. 2006. Semantic hierarchy refactoring by abstract interpretation.
In Verification, Model Checking, and Abstract Interpretation. Springer, Berlin, 313–331.
Mika Mäntylä, Jari Vanhanen, and Casper Lassenius. 2003. A taxonomy and an initial empirical study of
bad smells in code. In International Conference on Software Maintenance (ICSM). IEEE, Los Alamitos,
CA, 381–384.
Radu Marinescu. 2004. Detection strategies: Metrics-based rules for detecting design flaws. In 20th Interna-
tional Conference on Software Maintenance (ICSM). 350–359.
Tom Mens and Tom Tourwé. 2004. A survey of software refactoring. IEEE Trans. Softw. Eng. 30, 2 (2004),
126–139.
Wiem Mkaouer, Marouane Kessentini, Adnan Shaout, Patrice Koligheu, Slim Bechikh, Kalyanmoy Deb, and
Ali Ouni. 2015. Many-objective software remodularization using NSGA-III. ACM Trans. Softw. Eng.
Methodol. 24, 3 (2015), 17.
Naouel Moha, Yann-Gaël Guéhéneuc, Laurence Duchien, and Anne-Francoise Le Meur. 2010. DECOR: A
method for the specification and detection of code and design smells. IEEE Trans. Softw. Eng. 36, 1 (Jan
2010), 20–36.
Naouel Moha, Amine Mohamed Rouane Hacene, Petko Valtchev, and Yann-Gaël Guéhéneuc. 2008. Refactor-
ings of design defects using relational concept analysis. In Formal Concept Analysis, Raoul Medina and
Sergei Obiedkov (Eds.). Lecture Notes in Computer Science, Vol. 4933. Springer, Berlin, 289–304.
Emerson Murphy-Hill and Andrew P. Black. 2008. Breaking the barriers to successful refactoring. In 30th
International Conference on Software Engineering (ICSE). 421–430.
Emerson Murphy-Hill and Andrew P. Black. 2010. An interactive ambient visualization for code smells. In
5th International Symposium on Software Visualization (VISSOFT). ACM, New York, NY, 5–14.
Emerson Murphy-Hill and Andrew P. Black. 2012. Programmer-friendly refactoring errors. IEEE Trans.
Softw. Eng. 38, 6 (2012), 1417–1431.
Emerson Murphy-Hill, Chris Parnin, and Andrew P. Black. 2012. How we refactor, and how we know it.
IEEE Trans. Softw. Eng. 38, 1 (2012), 5–18.
Stas Negara, Nicholas Chen, Mohsen Vakilian, Ralph E. Johnson, and Danny Dig. 2013. A comparative study
of manual and automated refactorings. In 27th European Conference on Object-Oriented Programming
(ECOOP). 552–576.
Mel Ó Cinnéide, Laurence Tratt, Mark Harman, Steve Counsell, and Iman Hemati Moghadam. 2012. Ex-
perimental assessment of software metrics using automated refactoring. In International Symposium
on Empirical Software Engineering and Measurement (ESEM). 49–58.
Mark O’Keeffe and Mel O. Cinnéide. 2008. Search-based refactoring for software maintenance. J. Syst. Softw.
81, 4 (2008), 502–516.
William F. Opdyke. 1992. Refactoring: A Program Restructuring Aid in Designing Object-Oriented Application
Frameworks. Ph.D. Dissertation. University of Illinois at Urbana-Champaign.
Fernando E. B. Otero, Colin G. Johnson, Alex A. Freitas, and Simon J. Thompson. 2010. Refactoring in auto-
matically generated programs. In 2nd International Symposium on Search Based Software Engineering
(SSBSE), Massimiliano Di Penta, Simon Poulding, Lionel Briand, and John Clark (Eds.). Benevento,
Italy.
Ali Ouni, Marouane Kessentini, and Houari Sahraoui. 2013. Search-based refactoring using recorded code
changes. In 17th European Conference on Software Maintenance and Reengineering (CSMR). 221–230.
Ali Ouni, Marouane Kessentini, Houari Sahraoui, and Mounir Boukadoum. 2012a. Maintainability defects
detection and correction: A multi-objective approach. Automat. Softw. Eng. 20, 1 (2012), 47–79.
Ali Ouni, Marouane Kessentini, Houari Sahraoui, and Mohamed Salah Hamdi. 2012b. Search-based refac-
toring: Towards semantics preservation. In 28th International Conference on Software Maintenance
(ICSM). 347–356.
Ali Ouni, Marouane Kessentini, Houari Sahraoui, and Mohamed Salah Hamdi. 2013. The use of develop-
ment history in software refactoring using a multi-objective evolutionary algorithm. In 15th Annual
Conference on Genetic and Evolutionary Computation (GECCO). 1461–1468.
Kyle Prete, Napol Rachatasumrit, Nikita Sudan, and Miryung Kim. 2010. Template-based reconstruction of
complex refactorings. In 26th International Conference on Software Maintenance (ICSM). 1–10.
Fawad Qayum and Reiko Heckel. 2009. Local search-based refactoring as graph transformation. In 1st
International Symposium on Search Based Software Engineering (SSBSE). 43–46.
Donald Bradley Roberts and Ralph Johnson. 1999. Practical Analysis for Refactoring. Ph.D. Dissertation.
Arnon Rotem-Gal-Oz, Eric Bruno, and Udi Dahan. 2012. SOA Patterns. Manning.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.
Multi-Criteria Code Refactoring Using Search-Based Software Engineering 23:53
Houari Sahraoui, Robert Godin, Thieny Miceli, and others. 2000. Can metrics help to bridge the gap between
the improvement of oo design quality and its automation? In International Conference on Software
Maintenance (ICSM). 154–162.
Vicent Sales, Ricardo Terra, Luis Fernando Miranda, and Marco Tulio Valente. 2013. Recommending move
method refactorings using dependency sets. In 20th Working Conference on Reverse Engineering (WCRE).
232–241.
Giuseppe Scanniello, Anna D’Amico, Carmela D’Amico, and Teodora D’Amico. 2010. Using the Kleinberg
algorithm and vector space model for software system clustering. In 18th International Conference on
Program Comprehension (ICPC). 180–189.
Olaf Seng, Johannes Stammel, and David Burkhart. 2006. Search-based determination of refactorings for
improving the class structure of object-oriented systems. In 8th Annual Conference on Genetic and
Evolutionary Computation (GECCO). ACM, New York, NY, 1909–1916.
Raed Shatnawi and Wei Li. 2011. An empirical assessment of refactoring impact on software quality using
a hierarchical quality model. Inte. J. Softw. Eng. Appl. 5, 4 (2011), 127–149.
Danilo Silva, Ricardo Terra, and Marco Tulio Valente. 2014. Recommending automated extract method
refactorings. In 22nd International Conference on Program Comprehension (ICPC). 146–156.
Gustavo Soares, Rohit Gheyi, and Tiago Massoni. 2013. Automated behavioral testing of refactoring engines.
IEEE Trans. Softw. Eng. 39, 2 (2013), 147–162.
Ladan Tahvildari and Kostas Kontogiannis. 2003. A metric-based approach to enhance design quality
through meta-pattern transformations. In 7th European Conference on Software Maintenance and
Reengineering (CSMR). 183–192.
Frank Tip and Jens Palsberg. 2000. Scalable propagation-based call graph construction algorithms. In 15th
Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA). 281–
293.
Nikolaos Tsantalis and Alexander Chatzigeorgiou. 2009. Identification of move method refactoring opportu-
nities. IEEE Trans. Softw. Eng. 35, 3 (2009), 347–367.
Nikolaos Tsantalis and Alexander Chatzigeorgiou. 2011. Identification of extract method refactoring oppor-
tunities for the decomposition of methods. J. Syst. Softw. 84, 10 (2011), 1757–1782.
Raja Vallée-Rai, Etienne Gagnon, Laurie Hendren, Patrick Lam, Patrice Pominville, and Vijay Sundaresan.
2000. Optimizing java bytecode using the soot framework: Is it feasible? In Compiler Construction, David
A. Watt (Ed.). Lecture Notes in Computer Science, Vol. 1781. Springer, Berlin, 18–34.
Atsushi Yamashita and Leon Moonen. 2013. Do developers care about code smells? An exploratory survey.
In 20th Working Conference on Reverse Engineering (WCRE). IEEE, Los Alamitos, CA, 242–251.
Minhaz F. Zibran and Chanchal K. Roy. 2011. A constraint programming approach to conflict-aware optimal
scheduling of prioritized code clone refactoring. In 11th International Working Conference on Source
Code Analysis and Manipulation (SCAM). 105–114.
Eckart Zitzler and Lothar Thiele. 1998. Multiobjective optimization using evolutionary algorithms–a com-
parative case study. In Parallel Problem Solving from Nature–PPSN V. Springer, Berlin, 292–301.
ACM Transactions on Software Engineering and Methodology, Vol. 25, No. 3, Article 23, Publication date: May 2016.