Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Multiobjective Genetic Algorithm To Find Communities in Complex Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

418 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 16, NO.

3, JUNE 2012

A Multiobjective Genetic Algorithm to Find


Communities in Complex Networks
Clara Pizzuti

Abstract—A multiobjective genetic algorithm to uncover com- be much higher than the number of edges connecting to the
munity structure in complex network is proposed. The algo- remaining nodes of the graph, constitutes a general advice
rithm optimizes two objective functions able to identify densely
for community definition. This intuitive definition pursues
connected groups of nodes having sparse inter-connections. The
method generates a set of network divisions at different hierar- two different objectives: maximizing the internal links and
chical levels in which solutions at deeper levels, consisting of a minimizing the external links.
higher number of modules, are contained in solutions having a Multiobjective optimization is a problem solving technique
lower number of communities. The number of modules is auto- that successfully finds a set of solutions when multiple and
matically determined by the better tradeoff values of the objective
conflicting objectives must be optimized. These solutions are
functions. Experiments on synthetic and real life networks show
that the algorithm successfully detects the network structure and obtained through the use of Pareto optimality theory [15]
it is competitive with state-of-the-art approaches. and constitute global optimum solutions satisfying all the
objectives as best as possible. Evolutionary algorithms to
Index Terms—Complex networks, multiobjective clustering,
multiobjective evolutionary algorithms. solve multiobjective optimization problems revealed success-
ful because of their population-based nature which allows
I. Introduction the simultaneous production of multiple optima and a good
approximation of the Pareto front [5].
OMPLEX NETWORKS constitute an efficacious for-
C malism to represent the relationships among objects
composing many real-world systems. Collaboration networks,
Community detection, thus, could be formulated as a mul-
tiobjective optimization problem and the framework of Pareto
optimality can provide a set of solutions corresponding to the
the Internet, the world-wide-web, biological networks, com- best compromise among the objectives to optimize. In fact,
munication and transport networks, social networks are just there is a tradeoff between the two above-mentioned objectives
some examples. Networks are modeled as graphs, where nodes because when the community structure is constituted by the
represent the objects and edges represent the interactions overall network the number of external links is null, thus it is
among these objects. minimized, however the cluster density in not high.
An important problem in the study of complex networks In the last few years, many approaches have been proposed
is the detection of community structure [25], also referred to employ multiobjective techniques for data clustering. Most
to as clustering [21], i.e., the division of a network into of these proposals cluster objects in metric spaces [14], [17],
groups of nodes, called communities or clusters or modules, [18], [28], [38], [39], [49], [51], though a method for parti-
having dense intra-connections, and sparse inter-connections. tioning graphs has been presented in [8] and a graph clustering
This problem, as pointed out in [21], is meaningful only if algorithm of web user sessions is described in [12].
the graph modeling the network is sparse, i.e., the number In this paper, a multiobjective approach, named multiobjec-
of edges is much less than the possible number of edges, tive genetic algorithms for networks (MOGA-Net), to discover
otherwise it becomes similar to data clustering [31]. Clustering communities in networks by employing genetic algorithms
on graphs differs from data clustering since clusters in graphs is proposed. The method optimizes two objective functions
are based on edge density, while in data clustering they are introduced in [32] and [44] that revealed both efficacious in
groups of points close with respect to a distance or similarity detecting modules in complex networks. The first objective
measure. The concept of community in a network, however, function employs the concept of community score to measure
is not rigorously defined since its definition is influenced by the quality of the division in communities of a network. The
the application domain of interest. Thus, the intuitive notion higher the community score, the more dense the clustering ob-
that the number of edges inside the same community should tained. The second defines the concept of fitness of the nodes
Manuscript received February 17, 2010; revised June 8, 2010, October 12, belonging to a module and iteratively finds modules having
2010, and January 17, 2011; accepted March 30, 2011. Date of current version the highest sum of node fitness, in the following referred to as
May 24, 2012.
The author is with the Institute of High Performance Computing and
community fitness. When this sum reaches its maximum value,
Networking, National Research Council of Italy, Rende, Cosenza 87036, Italy the number of external links is minimized. Both the objective
(e-mail: pizzuti@icar.cnr.it). functions have a positive real-valued parameter controlling the
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
size of the communities. The higher the value of the parameter,
Digital Object Identifier 10.1109/TEVC.2011.2161090 the smaller the size of the communities found. MOGA-Net
1089-778X/$31.00 
c 2012 IEEE
PIZZUTI: A MULTIOBJECTIVE GENETIC ALGORITHM TO FIND COMMUNITIES IN COMPLEX NETWORKS 419

exploits the benefits of these two functions and obtains the is the number of edges connecting i to the rest of the network.
communities present in the network by selectively exploring A subgraph S is a community in a strong sense if
the search space, without the need to know in advance the
exact number of groups. This number is automatically deter- kiin (S) > kiout (S), ∀i ∈ S.
mined by the optimal compromise values of the two objectives. A subgraph S is a community in a weak sense if
An interesting result of the multiobjective approach is that  
it returns not a single partitioning of the network, but a set kiin (S) > kiout (S).
of solutions. Each of these solutions corresponds to a differ- i∈S i∈S
ent tradeoff between the two objectives and thus to diverse Thus, in a strong community, each node has more connec-
partitionings of the network consisting of various numbers of tions within the community than with the rest of the graph.
clusters. Experiments on synthetic and real life networks In a weak community, the sum of the degrees within the
showed that the set of Pareto optimal solutions uncovers subgraph is larger than the sum of degrees toward the rest
the hierarchical organization of the network, where solutions of the network. In the following, we adopt the concept of
with a higher number of clusters are included in solution weak community, thus a community is interpreted as a set of
having a lower number of communities. This peculiarity of nodes having a total number of intra-connections higher than
the multiobjective approach gives a great chance to analyze the the number of inter-connections among different clusters.
network at different hierarchical levels and study communities
with different modular levels.
This paper is organized as follows. In the next section, III. Related Work
the concept of community is defined and the community
Many different algorithms, coming from different fields
detection problem is formalized. Section III describes the main
such as physics, statistics, data mining, and evolutionary com-
approaches to community detection. Section IV formulates the
putation have been proposed to detect communities in complex
community detection problem as a multiobjective optimization
networks. The approaches adopted can broadly be classified
problem. Section V describes the method, the genetic represen-
into three different types: divisive hierarchical methods, ag-
tation adopted, and the variation operators used. In Section VI,
glomerative hierarchical methods [31], and optimization meth-
the results of the method on synthetic and real life networks
ods. The divisive hierarchical methods start from the complete
and a comparison with some of the state-of-the-art approaches
network, detect the edges that connect different communities,
are reported. Section VII, finally, discusses the advantages of
and remove them. Examples of these approaches can be found
the multiobjective approach and concludes this paper.
in [3], [25], [35], [41], [42], and [48]. Agglomerative ap-
proaches consider each node a cluster and then merge similar
communities recursively until the whole graph is obtained [4],
II. Community Definition [34], [40], [45], [47], [58]. Optimization methods define an
A network N can be modeled as a graph G = (V, E), where objective function that allows the division of a graph in sub-
V is a set of objects, called nodes or vertices, and E is a graphs, and try to maximize this objective in order to obtain
set of links, called edges, that connect two elements of V . the best partitioning of the network [1], [32], [53]. Among the
A community (also called cluster or module) in a network is optimization methods, several approaches have been developed
a group of vertices (i.e., a sub-graph) having a high density by using evolutionary techniques. In particular, [18], [20], [26],
of edges within them, and a lower density of edges between [29], [34], [44], [55] applied genetic algorithms. Many other
groups. This definition of community is rather vague and there proposals employ multiobjective evolutionary algorithms to
is no general agreement on the concept of density. A more partition graphs or cluster objects in metric spaces [8], [12],
formal definition has been introduced in [48] by considering
 [14], [17], [28], [38], [39], [49], [51].
the degree ki of a generic node i, defined as ki = j Aij , In the following, we first review the main proposals coming
where A is the adjacency matrix of G. A is such that an entry from physics and data mining fields, and then a description
at position (i, j) is 1 if there is an edge from node i to node j, of the multiobjective evolutionary clustering approaches is
0 otherwise. Let S ⊂ G be the subgraph where node i belongs reported.
to, the degree of i with respect to S can be split as
A. Community Detection in Networks
ki (S) = kiin (S) + kiout (S)
The community detection problem has been studied by
where several researchers, and a complete description of the state-of-
 the-art proposals is beyond the scope of this paper. Extensive
kiin (S) = Aij and detailed overviews of community identification methods
j∈S in complex networks can be found in [6], [21], and [23].
One of the most famous algorithms to detect communities
is the number of edges connecting i to the other nodes in S,
has been presented by Newman and Girvan [25]. The method
and
 iteratively splits the network by removing edges. The edges
kiout (S) = Aij to be removed are chosen by using the betweenness measure.
j ∈S
/ The idea underlying the edge betweenness comes from the
420 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 16, NO. 3, JUNE 2012

observation that if two communities are joined by a few inter- Pons and Latapy [45] introduced an agglomerative hierar-
community edges, then all the paths from vertices in one chical algorithm, named Walktrap, to compute the community
community to vertices in the another must pass through these structure of a network. The approach is based on the concept
edges. Paths determine the betweenness score to compute for of random walk on a graph and on the idea that random walks
the edges. By counting all the paths passing through each tend to get trapped in densely connected parts of the graph. A
edge, and removing the edge scoring the maximum value, the new definition of distance between two nodes is introduced by
connections inside the network are broken. This process is exploiting the properties of random walks, and this definition
repeated, thus dividing the network into smaller components is generalized to compute the distance between communities.
until no edges remain. The algorithm thus starts from a partition of the graph in
The same authors in [42] proposed a divisive hierarchical which each node is a community, and then merges the two
method based on different betweenness measures. In this adjacent communities (i.e., having at least a common edge)
paper, Newman and Girvan point out the need of having that minimize the mean of the squared distances between each
a measure of the quality of the network division found by vertex and its community. The distances between communities
an algorithm. To this end, they introduce the concept of are recomputed and the previous step is repeated until all
modularity. Informally, the modularity is the fraction of edges the nodes belong to the same community. In order to decide
inside communities minus the expected value of the fraction of the best partitioning to choose, the modularity criterion of
edges, if edges fall at random without regard to the community Newman and Girvan is adopted.
structure (a formal definition of modularity is given in the Pujol et al. [47] proposed an agglomerative hierarchical
next section). Values approaching 1 indicate strong community method that combines spectral analysis and modularity op-
structure. Thus, the algorithm computes the modularity for timization to obtain efficiency and accuracy in clustering
each split of the network in communities, and the authors show a network. They used the same concept of random walk
that, when community structure is known a priori, high values adopted by Pons and Latapy [45] to produce an initial partition
of modularity closely correspond to the expected network of the network, then an agglomerative hierarchical method
division. that iteratively joins two communities is applied. In order to
Newman [40] argued that since high values of modularity merge two clusters, the group of nodes that gives the least
correspond to good network division, an approach to find the contribution to the total modularity is selected and it is joined
best possible partitioning of a network could be to simply with the group that maximizes the increment of modularity.
optimize it. Thus, he presented an agglomerative hierarchical Lancichinetti et al. [32] proposed a method to detect
method that searches for optimal values of modularity. New- overlapping and hierarchical community structure based on
man observed that an exhaustive search of all the possible the concept of community fitness of a module S. Let kiin (S)
divisions to obtain the optimal value of modularity is unfea- and kiout (S) be the internal and external degrees of the nodes
sible for networks constituted by more than 20 vertices, thus belonging to a community S. The community fitness P(S) of
approximation methods are needed. He proposed a greedy ap- S is then defined as follows:
proach that joins communities producing the greatest increase  kiin (S)
in modularity value. A faster method version, based on the P(S) =
i∈S
(ki (S) + kiout (S))α
in
same strategy, was described in [4] by Clauset, Newman, and
Moore. where α, called resolution parameter, is a positive real-valued
Blondel et al. [3] presented a method that partitions large parameter controlling the size of the communities. When
networks based also on the modularity optimization. The algo- kiout (S) = 0 ∀i, P(S) reaches its maximum value for a fixed
rithm consists of two phases that are repeated iteratively until α. The community fitness has been used by [32] to find
no further improvement can be obtained. At the beginning, communities one at a time. The authors introduced the concept
each node of the network is considered a community. Then, of node fitness with respect to a community S as the variation
for each node i, all its neighbors j are considered, and the gain of the community fitness of S with and without the node i,
in modularity for removing i from its community and adding that is
it to the j community is computed. The node is placed in the Pi (S) = P(S ∪ {i}) − P(S − {i}).
community for which the gain is positive and maximum. If no
community has positive gain, i remains in its original group. The method starts by picking a node at random, and
This first phase is repeated until no node move can improve considering it as a community S. Then a loop over all the
the modularity. The second phase builds a network where the neighbor nodes of S not included in S is performed, in order to
communities obtained are considered as the new nodes, and a choose the neighbor node to be added to S. The choice is done
link between two communities a, b exists if there is an edge by computing the node fitness for each node, and augmenting
between a node belonging to a and a node belonging to b. S with the node having the highest value of fitness. At this
The network can be weighted, in such a case the weight of point the fitness of each node is recomputed, and if a node
the edge between a and b is the sum of the weights of the turns out to have a negative fitness value it is removed from S.
links between nodes of the corresponding communities. At The process stops when all the not-yet-included neighboring
this point, the method can be reiterated until no more changes nodes of the nodes in S have a negative fitness. Once a
can be done to improve modularity. The algorithm returns all community has been obtained, a new node is picked and the
the clusterings found at different hierarchical levels. process restarts until all the nodes have been assigned to at
PIZZUTI: A MULTIOBJECTIVE GENETIC ALGORITHM TO FIND COMMUNITIES IN COMPLEX NETWORKS 421

least one group. The authors found that the partitions obtained vertices belonging to the same cluster. The silhouette index of
for the resolution parameter α = 1 are relevant. However, a node i is the normalized value of the difference between the
they introduced a criterion to choose a partition based on the minimum average dissimilarity between node i and the nodes
concept of stability. A partition is considered stable if it is of the other clusters, and the average dissimilarity among i
delivered for a range of values of α. The length of this range and the vertices in the same cluster.
determines the more stable partition, which is deemed the best In the next section, the community detection problem is
result. formalized as a multiobjective optimization problem.

B. Multiobjective Clustering Methods


The application of multiobjective optimization to clustering IV. Community Detection as a Multiobjective
data has recently obtained an increasing interest [14], [17], Optimization Problem
[28], [38], [39], [49], [51], though few proposals regard the
Many problems in different fields are naturally formulated
partitioning of networks [8], [12].
with multiple objectives. In particular, the division of a net-
A reference approach to multiobjective clustering algo-
work in subgroups of nodes having dense intra-connections
rithms for numerical and categorical data is that proposed by
and sparse interconnections has two competing objectives. The
Handl and Knowles [28], and named multiobjective clustering
first is to maximize the links among the nodes belonging to
with automatic K-determination (MOCK). The first objective
the same module, the second is to minimize the number of
of MOCK is to minimize the overall deviation of a partition-
connections between the communities. Thus, the problem of
ing, i.e., the summed distances between data items and the
community detection cannot adequately be represented as a
center of the cluster they have been assigned. The second
single objective augmented with constraints to try to implicitly
objective is the minimization of the cluster connectedness,
satisfy the other. A more suitable approach is to formalize this
which evaluates for each cluster data point how many of
problem as a multiobjective clustering problem [19], [28].
its nearest neighbors have been placed in the same cluster.
A multiobjective clustering problem (, F1 , F2 , . . . , Ft ) is
The algorithm adopts the locus-based adjacency representa-
defined as
tion proposed by Park and Song [43], described in the next
sections and employed also by MOGA-Net, and uses a special min Fi (S), i = 1, . . . , t, subject to S ∈ 
initialization of the solutions based on the minimum spanning
tree that reduces execution times. MOCK contains also a where  = {S1 , . . . , Sk } is the set of feasible clusterings of a
final step for selecting the best solution from the Pareto front network, and F = {F1 , F2 , . . . , Ft } is a set of t single criterion
approximations that automatically delivers the optimal number functions. Each Fi :  → R is a different objective function
of clusters. that determines the feasibility of the clustering obtained.
MOCK is not specialized for partitioning networks, though Since F is a vector of competing objectives that must be
it can be adapted to clustering on graphs by considering the simultaneously optimized, there is not one unique solution to
adjacency matrix of a network as a (dis)similarity matrix. the problem, but a set of solutions are found through the use
A proposal for graph partitioning that optimizes three differ- of Pareto optimality theory [15]. Given two solutions S1 and
ent objectives was proposed by Datta et al. [8]. The objectives S2 ∈ , solution S1 is said to dominate solution S2 , denoted
minimize the net loss in edge values when two connected as S1 ≺ S2 , if and only if
nodes are placed in different groups, the difference in size of
the groups, and the spread of clusters. The authors emphasized ∀i : Fi (S1 ) ≤ Fi (S2 ) ∧ ∃ i s.t. Fi (S1 ) < Fi (S2 ).
on the concept of zone in the graph, intended as group of
A dominated solution is not interesting because an im-
adjacent nodes. Thus a chromosome is a collection of nodes,
provement can be attained in all the objectives. Instead, a
where each node is specified by its location in the graph. The
nondominated solution is one in which an improvement in
algorithm is able to divide the graph in a variable number of
one objective requires a degradation of another. Multiobjective
zones, however the range of zones and of the number of nodes
optimization aims to the generation and selection of nondomi-
per zone must be fixed as input parameter.
nated solutions, these solutions are called Pareto-optimal. The
More recently, a multiobjective evolutionary algorithm, spe-
goal is therefore to construct the Pareto optima. More formally,
cialized for clustering Web user sessions, has been proposed
the set of Pareto-optimal solutions  is defined as
by Nildem et al. [12]. The clusters obtained are then used in a
Web recommendation system for representing usage patterns.  = {S ∈  : ∃S ∈  with S ≺ S}.
The sequences of Web pages visited by a user are represented
as a weighted undirected graph where each sequence is a node, The vector F maps the solution space into the objective
and the weight of an edge connecting two sequences is the function space. When the nondominated solutions are plotted
computed similarity between the two nodes. Their algorithm, in the objective space, they are called the Pareto front. Thus,
named GraSC, uses the same representation of MOCK, but the the Pareto front represents the better compromise solutions
conflicting objectives to optimize are the min-max cut [13] satisfying all the objectives as best as possible. It is worth
and the silhouette index [50]. The former tries to optimize noting that the Pareto-optimal solutions as outlined in [28]
the intra-cluster similarity and to minimize the inter-sub-graph always include the optimal solutions of the clustering problems
similarity, the latter computes the average silhouette index of with a single objective to optimize.
422 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 16, NO. 3, JUNE 2012

A. Objective Functions been set to 1 because, as the authors observed, in most cases
Our aim is to partition a network in groups of vertices the partitioning found for this value are relevant. The second
{S1 , . . . Sk } such that the density of edges within them is objective to minimize is thus
higher than the density of edges between the groups. To 
k

this end, we need an objective function that maximizes the P(Si ).


number of connections inside each community, and another i=1

objective function that minimizes the number of links between In the next section, we propose a multiobjective community
the modules found. detection approach that optimizes both these two objectives.
A quality measure of a community S that maximizes the in-
degree of the nodes belonging to S has been introduced in [44].
On the other hand, a criterion that minimizes the out-degree V. Algorithm Description
of a community is defined in [32]. Both the approaches adopt In this section, we give a description of the multiobjective
the definition of weak community described above. We now algorithm MOGA-Net, the representation adopted for partition-
first recall the definitions of these measures, and then show ing the network, and the variation operators used. In the last
how they can be exploited in a multiobjective approach to few years many efforts have been devoted to the application
find communities. In the following, without loss of generality, of evolutionary computation to the development of multiob-
the graph modeling a network is assumed to be undirected. jective optimization algorithms. Evolutionary algorithms, in
Let μi denote the fraction of edges connecting node i to the fact, proved to be very successful to solve multiobjective
other nodes in S. More formally optimization problems because of the population-based nature
1 in of the approach that allows the generation of several elements
μi = k (S)
|S| i of the Pareto set in a single run [5], [10].
where | S | is the cardinality of S.
A. Genetic Representation
The power mean of S of order r, denoted as M(S), is defined
as Our clustering algorithm uses the locus-based adjacency
 representation proposed in [43] and employed by [28] and
(μi )r
M(S) = i∈S . [38] for multiobjective clustering. In this graph-based repre-
|S|
sentation, an individual of the population consists of N genes
Notice that, in the computation of M(S), since 0 ≤ μ ≤ 1, the g1 , . . . , gN and each gene can assume allele value j in the
exponent r increases the weight of nodes having many con- range {1, . . . , N}. Genes and alleles represent nodes of the
nections with other nodes belonging to the same module, and graph G = (V, E) modeling a network N , and a value j
diminishes the weight of those nodes having few connections assigned to the ith gene is interpreted as a link between the
inside S. nodes i and j of V . This means that in the clustering solution
The volume vS of a community S is defined as the number found i and j will be in the same cluster. A decoding step,
of edges connecting vertices inside S, i.e., the number of 1 however, is necessary to identify all the separate components
entries in the adjacency sub-matrix of A corresponding to S of the corresponding graph. The nodes participating to the
 same component are assigned to one cluster. As observed
vS = Aij .
in [28], the decoding step can be done in linear time. A
i,j∈S
main advantage of this representation is that the number k
The score of S is defined as score(S) = M(S)×vS . Thus, the of clusters is automatically determined by the number of
score takes into account both the fraction of interconnections components contained in an individual and determined by
among the nodes (through the power mean) and the number the decoding step. Fig. 1(a) shows a network of ten nodes
of interconnections contained in the module S (through the partitioned in two groups. The nodes of the two partitions
volume). The community score of a clustering {S1 , . . . Sk } of are depicted as circles and squares, respectively. Among the
a network is defined as possible encoded genotypes, that shown in Fig. 1(b) is decoded
k in the two connected components reported in Fig. 1(c). These
CS = score(Si ). two components correspond to the partitioning of the graph.
i=1

The first objective to maximize is then the community score B. Initialization


CS. The initialization process takes into account the effective
As described in Section III, Lancichinetti et al. [32] intro- connections of the nodes in the network. A random individual
duced the concept of community fitness of a module S as is generated. However, if in the ith position there is an allele
 kiin (S) value j, but the edge (i, j) does not exist, the individual j is
P(S) = . substituted with one of the neighbors of i. For example, in
i∈S
(ki (S) + kiout (S))α
in
Fig. 2(a) in the positions 3 and 10 the corresponding allele
The second objective is thus carried out by the community values are 9 and 5, respectively. However the edges (3, 9) and
fitness by summing up the fitnesses of all the Si modules. (10, 5) are not present in the network shown in Fig. 1(a), thus
The parameter α, that tunes the size of the communities, has 9 is substituted by 4, and 5 is substituted by 7.
PIZZUTI: A MULTIOBJECTIVE GENETIC ALGORITHM TO FIND COMMUNITIES IN COMPLEX NETWORKS 423

Fig. 3. Example of uniform crossover.

the neighbors of gene i. For example, considering the network


of Fig. 1(a), the allowed allele values of the gene in the third
position are 2, 4, 5, 6. This mutation guarantees the generation
of a mutated child in which each node is linked only with one
of its neighbors.

E. Model Selection
Multiobjective clustering returns the set of Pareto-optimal
solutions. Each of these solutions corresponds to a different
tradeoff between the two objectives and thus to diverse parti-
tioning of the network consisting of various numbers of clus-
ters. This gives a great chance to analyze several clusterings
at different hierarchical levels. However, a criterion should be
established to automatically select one solution with respect
to another. To this end, we adopt the concept of modularity,
Fig. 1. (a) Network of ten nodes partitioned in two communities introduced by Newman and Girvan [42]. Modularity is the
{1, 2, 3, 4, 5, 6} and {7, 8, 9, 10}. (b) Locus-based representation of a geno-
type. (c) Graph-based structure of the genotype.
most used and known function to assess the quality of a
partitioning obtained by a clustering method. Let k be the
number of modules found inside a network, the modularity is
defined as
 k  
ls ds 2
Q= −( )
s=1
m 2m
where ls is the total number of edges joining vertices inside
the module s, and ds is the sum of the degrees of the nodes
Fig. 2. (a) Genotype where the couples (3, 9) and (10, 5) are not edges of of s. The first term of each summand of the modularity Q
the graph reported in Fig. 1(a). (b) Modified genotype. is the fraction of edges inside a community, the second one
is the expected value of the fraction of edges that would be
C. Uniform Crossover in the network if edges fall at random without regard to the
community structure. Values approaching 1 indicate strong
MOGA-Net uses a standard uniform crossover operator. community structure. We thus select, among the solutions
First, a crossover mask of length N, i.e., the number of nodes, found on the Pareto front, that having the highest value of
is randomly generated. Each value on the mask is either 0 or modularity.
1. An offspring is generated by selecting from the first parent Fig. 4 reports the pseudo-code of MOGA-Net. Given a
the genes where the mask is a 0, and from the second parent network N and the graph G modeling it, MOGA-Net starts
the genes where the mask is a 1. The main motivation of using with a population initialized at random. Every individual
uniform crossover is that it guarantees the maintenance of the generates a graph structure in which each component is a con-
effective connections of the nodes in the network in the child nected subgraph of G. For a fixed number of generations the
individual. In fact, because of the biased initialization, each multiobjective genetic algorithm evaluates the objective values,
individual in the population is such that if a gene i contains assigns a rank to each individual according to Pareto domi-
a value j, then the edge (i, j) exists. Since the child at each nance and sorts them. Then a new population is generated by
position i contains a value j coming from one of the two applying the specialized variation operators described above.
parents, then the edge (i, j) exists. Fig. 3 shows an example At the end of the procedure, MOGA-Net returns, among the
of uniform crossover. set of solutions contained in the Pareto front, that having the
highest value of modularity. In the next section, experimental
D. Mutation results will prove the ability of MOGA-Net in partitioning
The mutation operator that randomly changes the value j a network, and we show that the Pareto optimal solutions
of a ith gene causes a useless exploration of the search space, exhibit a hierarchical structure in which solutions with a higher
because of the same above observations on node connections. number of communities are contained in solutions having a
Thus, the possible values an allele can assume are restricted to lower number of modules.
424 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 16, NO. 3, JUNE 2012

Fig. 4. Pseudo-code of the MOGA-Net algorithm.

VI. Experimental Results Newman and Girvan, described in the previous section, thus it
In this section, we study the effectiveness of our approach has been used as internal validity index. On the other hand, the
on a synthetic data set. Then we compare the results obtained normalized mutual information (NMI) is an external measure
by MOGA-Net with other state-of-the-art approaches on some to estimate the similarity between the true partitions and the
real-world networks for which the partitioning in communities detected ones, that has been proved more appropriate for
is known. In both cases, we show that our algorithm success- network partitioning by Danon et al. [7].
fully detects the network structure and it is competitive with The normalized mutual information is a well-known entropy
the other approaches. measure in information theory [37]. Given two partitions A
The MOGA-Net algorithm has been written in MATLAB, and B of a network in communities, let C be the confusion
using the Genetic Algorithms and Direct Search Toolbox 2. matrix whose element Cij is the number of nodes of commu-
The multiobjective genetic algorithm (MOGA) we used is the nity i of the partition A that are also in the community j of
nondominated sorting genetic algorithm (NSGA-II) proposed the partition B. The normalized mutual information I(A, B) is
by Deb et al. [11] and implemented in the GA Toolbox defined as follows:  
of MATLAB. NSGA-II builds a population of competing −2 ci=1 A cB
j=1 Cij log(Cij N/Ci. C.j )
I(A, B) = cA cB
individuals and ranks them on the basis of nondominance i=1 Ci. log(Ci. /N) + j=1 C.j log(C.j /N)
(for a detailed description of the approach see [10]). It is
where cA (cB ) is the number of groups in the partition A (B),
known that setting parameter values is a challenging research
Ci. (C.j ) is the sum of the elements of C in row i (column j),
problem in evolutionary algorithms [16]. Recently, Smith and
and N is the number of nodes. If A = B, I(A, B) = 1. If A
Eiben [54] found that it is possible to find good parameter
and B are completely different, I(A, B) = 0.
values for a set of problems, but general tuning that allows
for good performance on a wide range of problems raises B. Synthetic Data Set
specific difficulties. As regards MOGA-Net, we employed
a trial-and-error procedure and then selected the parameter In order to check the ability of our approach to successfully
values giving good results for the benchmark data sets. Thus, detect the community structure of a network, we use the
we set crossover rate 0.8, mutation rate 0.2, elite reproduction benchmark proposed by Lancichinetti et al. [33], which is
10% of the population size, roulette selection function. The an extension of the classical benchmark proposed by Girvan
population size was 300, the number of generations 100. and Newan [25]. The network consists of 128 nodes divided
into four communities of 32 nodes each. Every node has an
average degree of 16 and shares a fraction γ of links with the
A. Evaluation Metrics nodes of its community, and 1 − γ with the other nodes of
Community detection methods are supposed to identify the network. γ is called the mixing parameter. When γ < 0.5
good partitions [21]. In order to determine what good partition the neighbors of a node inside its group are more than the
means, validity indices must be defined to assess the quality neighbors belonging to the other three groups. We generated
of the results obtained by an algorithm. A validity index, ten different networks for values of γ ranging from 0.1 to 0.5,
also called quality function, is a function that assigns a score and used the normalized mutual information to measure the
to each partition of a network. The higher the score, the similarity between the true partitions and the detected ones.
better the partition obtained. Validity indices can be internal, Figs. 5 and 6 show the normalized mutual information and
i.e., they rely on the connections and separation between modularity, averaged over the ten runs, for different values
the communities, or external, through the use of additional of the exponent r when the mixing parameter γ increases
domain knowledge to assess the clustering outcomes. The from 0.1 to 0.5. Fig. 5 points out that, independently the
most popular internal quality function is the modularity of value of r, MOGA-Net is able to recover more than the 80%
PIZZUTI: A MULTIOBJECTIVE GENETIC ALGORITHM TO FIND COMMUNITIES IN COMPLEX NETWORKS 425

books on American politics, well studied in the literature (see


http://www-personal.umich.edu/∼mejn/netdata/), and compare
our results with those obtained by three algorithms coming
from network analysis, Blondel et al. [3] (referred to as
BGLL), Clauset et al. [4] (referred to as CNM), and Pons and
Latapy [45] (referred to as PL), and other two coming from
the evolutionary computation field that apply multiobjective
optimization, Handl and Knowles [28] (MOCK), and Nildem
et al. [12] (GraSC). In the following, we first repot a brief
description of each data set used.
The Zackary’s Karate Club network was generated by
Zachary, who studied the friendship of 34 members of a karate
club over a period of two years. During this period, because
of disagreements, the club divided in two groups almost of the
same size.
Bottlenose Dolphins is a social network of 62 bottlenose
dolphins living in Doubtful Sound, New Zealand, compiled
Fig. 5. Normalized mutual information obtained by MOGA-NET on the
by Lusseau [36] from seven years of dolphins behavior. A
synthetic network for different values of the exponent r when the mixing tie between two dolphins was established by their statistically
parameter varies from 0.1 to 0.5. significant frequent association. The network split naturally
into two large groups, the number of ties being 159.
The American College Football network [25] comes from
the United States college football. The network represents the
schedule of Division I games during the 2000 season. Nodes
in the graph represent teams and edges represent the regular
season games between the two teams they connect. The teams
are divided in conferences. The teams, on average, played four
inter-conference matches and seven intra-conference matches,
thus teams tend to play between members of the same con-
ference. The network consists of 115 nodes and 616 edges
grouped in 12 teams.
Krebs’ books on American politics is a network of political
books compiled by V. Krebs. The nodes represent 105 recent
books on American politics brought from Amazon.com, and
edges join pairs of books frequently purchased by the same
buyer [41]. Books were divided by Newman [41] according to
their political alignment (conservative or liberal), except for a
small number of books (13) having no clear affiliation.
Fig. 6. Modularity obtained by MOGA-NET on the synthetic network for All the algorithms have been executed ten times. As regards
different values of the exponent r when the mixing parameter varies from 0.1
to 0.5.
the algorithms of Clauset et al. [4], Blondel et al. [3], and
Pons and Latapy [45], at each run the solution having the
best modularity value is selected and the corresponding NMI
of community structure when for each node, the number of value is computed. As regards MOCK [28], GraSC [12], and
neighbors inside its group is lower with respect to that toward MOGA-Net, each run generates a set of solutions, those of
other groups (until γ ≤ 0.2). However, when the mixing the Pareto front. Among these optimal solutions we adopted
parameter increases, higher values of r help in the retrieval the same selection criterion, thus the solution having the
of the true community structure. Notice that for γ = 0.5, maximum modularity value is chosen and the corresponding
each node has half of the links inside its community and NMI computed. The average and standard deviation values
the other half with the rest of the network, thus it is very over these ten runs of both modularity and normalized mutual
difficult to identify the hidden groups, being the communities information are calculated and reported in Tables I and II. In
mixed each other. As expected, the modularity values of the MOGA-Net the value of the parameter r for the computation
communities obtained reflect the corresponding normalized of the community score has been set to 2 because we exper-
mutual information. imented that the communities found are relevant. However,
it is worth noting that the multiobjective approach implicitly
C. Real-Life Data Sets explores the search space by finding solutions that could be
We now show the application of MOGA-Net on four real- obtained for different values of r.
world networks, the Zachary’s Karate Club, the Bottlenose The tables clearly show the very good performance of
Dolphins, the American College Football, and the Krebs’ MOGA-Net with respect to the other approaches. In fact,
426 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 16, NO. 3, JUNE 2012

TABLE I
Best Modularity Results and Corresponding Normalized Mutual Information Obtained by MOGA-Net and the Other Algorithms
for the Real-Life Networks Zakary’s Karate Club and Bottlenose Dolphins

Zakary’s Karate Club Bottlenose Dolphins


Method Modularity NMI Modularity NMI
MOGA-Net 0.416 (0.740e-16) 0.602 (0.117e-15) 0.505 (0.0095) 0.506 (0.0468)
BGLL (Blondel et al. [3]) 0.415 0.707 0.495 0.450
CNM (Clauset et al. [4]) 0.380 0.692 0.495 0.573
PL (Pons and Latapy [45]) 0.394 0.562 0.517 0.675
MOCK (Handl and Knowles [28]) 0.326 (0.0347) 0.549 (0.1203) 0.419 (0.0271) 0.437 (0.0805)
GraSC (Nildem et al. [12]) 0.120 (0.0292) 0.198 (0.0217) 0.073 (0.0106) 0.096 (0.0333)
TABLE II
Best Modularity Results and Corresponding Normalized Mutual Information Obtained by MOGA-Net and the Other Algorithms
for the Real-Life Networks American College Football and Kreb’s Books

American College Football Krebs’ Books


Method Modularity NMI Modularity NMI
MOGA-Net 0.515 (0.0161) 0.775 (0.0234) 0.518 (0.0044) 0.537 (0.0251)
BGLL (Blondel et al. [3]) 0.601 0.926 0.515 0.442
CNM (Clauset et al. [4]) 0.577 0.762 0.502 0.530
PL (Pons and Latapy [45]) 0.602 0.879 0.515 0.543
MOCK (Handl and Knowles [28]) 0.454 (0.0608) 0.721 (0.0648) 0.437 (0.0081) 0.302 (0.1393)
GraSC (Nildem et al. [12]) 0.285 (0.2900) 0.447 (0.3866) 0.036 (0.0391) 0.078 (0.0192)

TABLE III all the ten runs with a modularity value of 0.371 and 0.373,
Best NMI Results Obtained by MOGA-Net on the Real-Life Data respectively. On the Krebs’ books network again MOGA-Net
Sets obtained the partitioning more similar to the true one, while
on the Football network the average best NMI is lower with
MOGA-Net respect to BGLL and PL.
Avg Best NMI Std Best NMI Avg Mod Std Mod
Zackary’s Karate 1 0 0.371 0
Club D. Comparing the Multiobjective Solutions
Bottlenose 1 0 0.373 0 When dealing with multiobjective optimization, an impor-
Dolphins
American College 0.795 0.016 0.497 0.027
tant aspect to consider is the evaluation of the solutions
Football obtained by an algorithm. In this section, the performances of
Krebs’ books 0.597 0.014 0.470 0.021 MOGA-Net and MOCK are compared with respect to a metric
specialized to assess the quality of the outcomes produced
though the algorithm of Pons and Latapy [45] obtains a slightly by multiobjective optimization methods. Zitzler et al. [59]
better modularity value on the Dolphins network (0.517 versus argued that results of a multiobjective method should meet
0.505) and American College Football (0.602 versus 0.536), three main issues. The distance of the Pareto front generated
the solutions found by MOGA-Net are comparable on these by the algorithm from the optimal Pareto front should be
two data sets and better on the other two. It is worth noting minimized, the solutions should be uniformly distributed over
that the multiobjective methods MOCK and GraSC are not the solution space, and the number of elements of the Pareto
able to reveal the community structure. However this is optimal set should be maximized. Metrics that try to measure
comprehensible, since the objectives they optimize are not the last issue, like error rate [57] and generational distance
much relevant for the problem of community detection. [56], or all the three issues, like space covered [60], assume
Often best modularity does not correspond to the true the knowledge of the Pareto optimal front, which could not
network partition. To show that MOGA-Net is effective in be available for real-life problems. Zitlzer and Thiele [60]
discovering the effective network structure, over the ten runs, proposed also a metric, named coverage metric, that evaluate
instead of choosing the partitioning having the best modularity whether the outcomes of an algorithm dominate the results of
value, we selected that having the best NMI value, and another algorithm. This metric is not apt to compare MOGA-
computed the corresponding modularity. The average values Net and MOCK since the objectives optimized by the two
over these ten runs are reported in Table III. The table reports methods are not the same. Schott [52] introduced a metric
the average of the best NMI (avg best NMI) and its standard called spacing that measures the distribution of the solutions
deviation (std best NMI), the average modularity value (avg over the nondominated front. Spacing between solutions is
Mod) corresponding to the solutions having the best NMI and computed as

its standard deviation (std Mod).  |Q|
1 
The table shows that on the Zackary’s Karate Club and S= (di − d)2
Bottlenose Dolphins MOGA-Net found the exact solution for Q i=1
PIZZUTI: A MULTIOBJECTIVE GENETIC ALGORITHM TO FIND COMMUNITIES IN COMPLEX NETWORKS 427

Fig. 7. (a) Pareto front of one run. (b) Network corresponding to the exact solution [node number (3) on the Pareto front]. (c) Network corresponding to
(6). (d) Network corresponding to (8).


where di = mink∈Q and k =i M m=1 | fm − fm | and fm (fm
i k i k TABLE IV
resp.) is the mth objective value of the ith (kth) solution Minimal Spacing Values Obtained by MOGA-Net and MOCK on
in the nondominated solution set Q. d is the mean value the Real-Life Data Sets
of all the di . The nearer the value of S to zero, the more
MOGA-Net MOCK
uniformly distributed the solutions found over the Pareto-
Avg MS Std MS Avg MS Std MS
optimal front. When the values of the objective values vary Zackary’s Karate Club 0.0201 0.0032 0.0788 0.0231
widely, a normalization of these values is necessary to avoid Bottlenose Dolphins 0.0096 0.0016 0.01903 0.0039
wrong results. To this end, the term | fmi − fmk | is divided by American College Football 0.0075 0.0014 0.0179 0.0043
| Fmmax − Fmmin | where Fmmax and Fmmin are the maximum and Krebs’ books 0.0128 0.0042 0.0188 0.0073
minimum values of the mth objective.
This measure fails to measure a distribution when there is a
large gap between two nondominated solutions. To overcome E. Hierarchical Pareto Front Solutions
this problem, Bandyopadhyay et al. [2] defined a modified As already observed, the solutions of the Pareto front have
measure, called minimal spacing (in the following referred to a hierarchical structure that allows the analysis of the network
as MS), that considers the distance from a solution to the at different organization levels. To show this characteristics,
nearest neighbor not already considered. Fig. 7(a) displays the Pareto front in one out of the ten runs
Table IV shows the average minimal spacing values and the for the Zackary’s Karate Club, and the networks (3), (6), and
corresponding standard deviation over the ten runs obtained (8) corresponding to the best value of NMI [solution (3)] and
by MOGA-Net and MOCK. The table points out that the the best two values of modularity [(6), (8)]. Network (3),
nondominated solutions found by MOGA-Net are distributed visualized in Fig. 7(b), corresponds to the true partitioning
more uniformly than those obtained by MOCK. In fact, the of the Zackary’s Karate Club in two groups. These two
average MS is much lower than that computed for MOCK. main groups, actually, could be spilt into tighter sub-groups.
428 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 16, NO. 3, JUNE 2012

Fig. 8. 85 communities obtained by MOGA-Net for the Director Boards network. Different colors identify different modules.

Network (6), shown in Fig. 7(c), for example, contains three TABLE V
communities, obtained by the division of the community on Comparison Between Best Modularity Value and Number of
the left of Fig. 7(b) in two subgraphs identified by blue squares Communities Obtained by MOGA-Net, PBD, and Newman
(nodes 1, 2, 3, 4, 8, 12, 13, 14, 18, 20, 22) and pink triangles Algorithms
(nodes 5, 6, 7, 11, 17). Network (8), displayed in Fig. 7(d),
consists of four modules obtained by the split of the two MOGA-Net PBD Newman
main groups of Fig. 7(b) in two subgroups, respectively. This Network Size Mod NC Mod NC Mod NC
Erdös 6927 0.5502 302 0.6817 20 0.6723 57
division has the highest value of modularity found (0.4020).
Scientometrics 2678 0.2879 148 0.5629 10 0.5555 24
Notice the small group constituted by only three nodes (25, Directors Board 1130 0.8253 85 0.8273 16 0.8046 21
26, 32).
These results show that the multiobjective approach is
effective in dealing with community identification in networks
values (Mod) obtained by MOGA-Net, PBD and Newman
and has the great advantage, with respect to single objective
algorithms, respectively, are reported. The values of the last
methods, to provide at the same time a set of optimal solutions,
two methods are those published in [47]. The table points
that contained in the Pareto front, thus allowing the exploration
out that when the size of the network is large, the number
of the modular organization of the network.
of communities found by MOGA-Net is much higher than
the number of communities found by PBD and Newman.
F. Results on Large Networks Furthermore, the modularity values of the last two methods
In this section, we further analyze the algorithm MOGA- are higher for Erdös and Scientometrics networks, while as
Net by considering other three networks modeling different regards the Directors Board MOGA-Net it reaches almost the
complex systems, and compare the results with those obtained same value of PBD and it is higher than Newman. It is worth
by the method proposed by Pujol et al. [47], referred to as PDB noting that both these two methods, as described in Section III,
after the initials of the authors, and the Newman’s algorithm are agglomerative hierarchical methods that merge groups of
described in [40], referred to as Newman. The three networks nodes when the modularity value is optimized.
are the Erdös collaboration network [46], the citation Scien- Recently, Fortunato and Barthélemy [22] proved that the
tometrics network [30], and the affiliation network among the optimization of modularity has a resolution limit that depends
Spanish top directors board [24]. In Table V, the network size, on the total size of the network and the interconnections
the number of communities found (NC), and the modularity of the modules. This implies that partitions obtained by the
PIZZUTI: A MULTIOBJECTIVE GENETIC ALGORITHM TO FIND COMMUNITIES IN COMPLEX NETWORKS 429

maximization of modularity could fail to obtain modules below partitioning with the highest modularity value, or it can be
this scale, even if tightly connected. Thus, important structures delegated to a expert on the base of the application domain.
at small scales, hidden within large groups having higher It is known that genetic algorithms can require high exe-
modularity value, could not be discovered. This problem is cution times when large populations of individuals are used.
further discussed by Good et al. [27], where it is argued Though fitness computation of the two objectives can be done
that optimal modularity partitions may not coincide with the in linear time with respect to the number of network nodes,
intuitive partition that correctly detects the modular structure the multiobjective approach employed has a time complexity
of a network. In particular, they state that high modularity quadratic in the population size [11]. On the other hand,
values mean that the partitioning obtained is very different genetic algorithms are naturally suited to be implemented on
from a random graph with the same degree sequence, and not parallel architectures. In order to deal with very large networks
necessarily that the partitioning is highly modular. and make the approach proposed competitive with the state-
Since MOGA-Net does not optimize the modularity value, of-the-art methods that detect communities, we were planning
the partitioning it finds differs from those obtained by the to realize an implementation of MOGA-Net on a parallel
other two methods. Consider Fig. 8 where the Director Boards machine.
network is depicted. Different colors of the nodes indicate the
85 different communities obtained by MOGA-Net.1 It is clear
from the figure that the low number of groups obtained by References
PBD (16) and Newman (21) indicates that the two algorithms [1] A. Arenas and A. Díaz-Guilera, “Synchronization and modularity
suffer of the resolution limit problem, since the many small in complex networks,” Eur. Phys. J., vol. 143, no. 1, pp. 19–25,
intuitive groups present in the network are merged together in Apr. 2007.
[2] S. Bandyopadhyay and S. K. Pal, “Multiobjective GAs, quantitative
few large communities. MOGA-Net, instead, though for some indices, and pattern classification,” IEEE Trans. Syst. Man Cybern. B
networks obtains partitioning of lower modularity value, has Cybern., vol. 34, no. 5, pp. 2088–2099, Oct. 2004.
no scale problems and allows the analysis of the network at [3] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefevre, “Fast
unfolding of communities in large networks,” J. Statist. Mech. Theory
local level. Exp., vol. 2008, no. 10, p. P10008, Oct. 2008.
[4] A. Clauset, M. E. J. Newman, and C. Moore, “Finding community
structure in very large networks,” Phys. Rev. E, vol. 70, no. 6, p. 066111,
VII. Discussion and Conclusion 2004.
[5] C. A. Coello Coello, G. B. Lamont, and D. A. Van Veldhuizen,
This paper proposed the formalization of the problem of Evolutionary Algorithms for Solving Multiobjective Problems. Berlin,
community detection in complex networks as a multiobjective Germany: Springer, 2007.
[6] L. Danon, A. Díaz-Guilera, J. Duch, and A. Arenas, “Community
clustering problem, and presented an evolutionary multiobjec- structure identification,” Large Scale Structure and Dynamics of Com-
tive approach to uncover community structure. The method plex Networks: From Information Technology to Finance and Natural
maximizes the intra-connections inside each community and Science. Singapore: World Scientific, 2007, pp. 93–113.
[7] L. Danon, J. Duch, A. Arenas, and A. Díaz-Guilera, “Comparing
minimizes inter-connections between different communities. community structure identification,” J. Stat. Mech., vol. 2005, p. P09008,
A main characteristic of the algorithm is that it automatically Sep. 2005.
affords a network partitioning without the need of knowing [8] D. Datta, J. R. Figueira, C. M. Fonseca, and F. Tavares-Pereira, “Graph
partitioning through a multiobjective evolutionary algorithm: A prelim-
a priori the precise number of clusters. This is particularly inary study,” in Proc. GECCO, 2007, pp. 625–632.
useful in all those applications where no information about [9] W. de Nooy, A. Mrvar, and V. Batagelj, Exploratory Social Network
the group division is available. The approach has been tested Analysis with Pajek. Cambridge, MA: Cambridge University Press,
2005.
on synthetic and real life networks, showing to be able [10] K. Deb, Multi-Objective Optimization Using Evolutionary Algorithms.
to correctly detect communities and to be competitive with Chichester, U.K.: Wiley, 2001.
state-of-the-art methods. The multiobjective approach has the [11] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist
multiobjective genetic algorithm: NSGA-II,” IEEE Trans. Evol. Comput.,
advantage, with respect to single objective approaches, to vol. 6, no. 2, pp. 182–197, Apr. 2002.
contemporarily optimize multiple criteria and to provide, not [12] G. Nildem Demir, A. Sima Uyar, and S. Gu̇ndu̇z Ȯǧudu̇cu̇, “Multiob-
a single partitioning, but a set of solution, each corresponding jective evolutionary clustering of web user sessions: A case study in
web page recommendation,” Soft Comput., vol. 14, no. 6, pp. 579–597,
to a different number of clusters, constituting the best tradeoff 2010.
between the competing objectives. Experiments showed that [13] C. H. Q. Ding, X. He, H. Zha, M. Gu, and H. D. Simon, “A min-max
the nondominated solutions contained in the Pareto front are cut algorithm for graph partitioning and data clustering,” in Proc. IEEE
ICDM, Dec. 2001, pp. 107–114.
meaningful and allow the analysis of the community structure [14] J. Du, E. E. Korkmaz, R. Alhajj, and K. Barker, “Novel clustering that
at different hierarchical levels. The investigation of the net- employs genetic algorithm with new representation scheme and multiple
work properties at various resolution levels is very important objectives,” in Proc. 6th Int. Conf. DAWAK, 2004, pp. 219–228.
[15] M. Ehrgott, Multicriteria Optimization, 2nd ed. Berlin, Germany:
since often organizations are arranged in a hierarchical form, Springer, 2005.
where small groups aggregate to produce larger communities. [16] Á. E. Eiben, R. Hinterding, and Z. Michalewicz, “Parameter control
The choice of one model with respect to another can be done in evolutionary algorithms,” IEEE Trans. Evol. Comput., vol. 3, no. 2,
pp. 124–141, Jul. 1999.
by adopting an internal criterion of quality, like that adopted [17] K. Faceli, A. C. P. L. F. de Carvalho, and M. C. P. de Souto,
by the approaches described in this paper, i.e., selecting the “Multiobjective clustering ensemble,” Int. J. Hybrid Intell. Syst., vol.
4, no. 3, pp. 145–156, 2007.
1 The figure has been realized by using Pajek [9]. It is worth noting that this [18] Z. Feng, X. Xu, N. Yuruk, and T. A. J. Schweiger, “A novel similarity-
visualization program uses at most 40 colors. When the number of clusters based modularity function for graph partitioning,” in Proc. 9th Int. Conf.
is above, Pajek cycles through the first forty colors again. DAWAK, 2007, pp. 385–396.
430 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 16, NO. 3, JUNE 2012

[19] A. Ferligoj and V. Batagelj, “Direct multicriterion clustering,” J. Clas- [46] Erdös Number Project [Online]. Available: http://www.oakland.edu/enp/
sification, vol. 9, no. 1, pp. 43–61, Jan. 1992. thedata
[20] A. Firat, S. Chatterjee, and M. Yilmaz, “Genetic clustering of social [47] J. M. Pujol, J. Béjar, and J. Delgado, “Clustering algorithm for deter-
networks using random walk,” Comput. Statist. Data Anal., vol. 51, mining community structure in large networks,” Phys. Rev. E, vol. 74,
no. 12, pp. 6285–6294, Aug. 2007. no. 1, p. 016107, Jul. 2006.
[21] S. Fortunato, “Community detection in graphs,” Phys. Rep., vol. 486, [48] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi,
pp. 75–174, 2010. “Defining and identifying communities in networks,” Proc. Natl. Acad.
[22] S. Fortunato and M. Barthélemy, “Resolution limit in community Sci. USA, vol. 101, no. 9, pp. 2658–2663, 2004.
detection,” Proc. Natl. Acad. Sci. USA, vol. 104, no. 1, pp. 36–41, Jan. [49] R. Romero-Záliz, C. Rubio-Escudero, J. P. Cobb, F. Herrera, O.
2007. Cordón, and I. Zwir, “A multiobjective evolutionary conceptual cluster-
[23] S. Fortunato and C. Castellano, “Community structure in graphs,” in ing methodology for gene annotation within structural databases: A case
Encyclopedia of Complexity and Systems Science, R. A. Meyers, Ed. of study on the gene ontology database,” IEEE Trans. Evol. Comput.,
Berlin, Germany: Springer, 2009, pp. 1141–1163. vol. 12, no. 6, pp. 679–701, Dec. 2008.
[24] Data from the Project “Small Worlds of Corporate Networks,” IESE [50] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and
Business School, Univ. Navarra, Pamplona, Spain, 2005. validation of cluster analysis,” J. Comput. Appl. Math., vol. 20, no. 1,
[25] M. Girvan and M. E. J. Newman, “Community structure in social and pp. 53–65, 1987.
biological networks,” Proc. Natl. Acad. Sci. USA, vol. 99, no. 12, pp. [51] S. Saha and S. Bandyopadhyay, “A new multiobjective clustering tech-
7821–7826, Jun. 2002. nique based on the concept of stability and symmetry,” Knowl. Inform.
[26] A. Gog, D. Dumitrescu, and B. Hirsbrunner, “Community detection in Syst., vol. 23, no. 1, pp. 1–27, 2010.
complex networks using collaborative evolutionary algorithms,” in Proc. [52] J. R. Schott, “Fault tolerant design using single and multicriteria genetic
9th ECAL, 2007, pp. 886–894. algorithm optimization,” M.S. thesis, Dept. Aeronautics Astronautics,
[27] B. H. Good, Y.-A. de Montjoye, and A. Clauset, “The performance of Massachusetts Instit. Technol., Cambridge, 1995.
modularity maximization in practical contexts,” Phys. Rev. E, vol. 81, [53] P. Schuetz and A. Caflish, “Multistep greedy algorithm identifies com-
no. 4, p. 046106, 2010. munity structure in real-world and computer-generated networks,” Phys.
[28] J. Handl and J. Knowles, “An evolutionary approach to multiobjective Rev. E, vol. 78, no. 2, p. 026112, Aug. 2008.
clustering,” IEEE Trans. Evol. Comput., vol. 11, no. 1, pp. 56–76, Feb. [54] S. K. Smit and Á. E. Eiben, “Parameter tuning of evolutionary algo-
2007. rithms: Generalist versus specialist,” in Applications of Evolutionary
[29] D. He, Z. Wang, B. Yang, and C. Zhou, “Genetic algorithm with ensem- Computation. Berlin, Germany: Springer, 2010, pp. 542–551.
ble learning for detecting community structure in complex networks,” [55] M. Tasgin, A. Herdagdelen, and A. Bingol. (2007). “Commu-
in Proc. 4th Int. Conf. Comput. Sci. Convergence Inform. Technol., Nov. nities detection in complex networks using genetic algorithms,”
2009, pp. 702–707. arXiv.org:0711.0491v1 [physics.soc-ph] [Online]. Available: http://arxiv.
[30] HISTCITE [Online]. Available: http://www.garfield.library.upenn.edu/ org/pdf/0711.0491v1
histcomp [56] D. A. van Veldhuizen and G. B. Lamon, “Multiobjective evolutionary
[31] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Englewood algorithm research: A history and analysis,” Dept. Electr. Comput. Eng.,
Cliffs, NJ: Prentice-Hall, 1988. Graduate School Eng., Air Force Instit. Technol., Wright-Patterson AFB,
[32] A. Lancichinetti, S. Fortunato, and J. Kertész, “Detecting the overlapping OH, Tech. Rep. TR-98-03, 1998.
and hierarchical community structure of complex networks,” New J. [57] D. A. van Veldhuizen and G. B. Lamon, “Multiobjective evolutionary
Phys., vol. 11, p. 033015, Mar. 2009. algorithm test suites,” in Proc. ACM SAC, 1999, pp. 551–557.
[33] A. Lancichinetti, S. Fortunato, and F. Radicchi. (2008). “New benchmark [58] K. Wakita and T. Tsurumi. (2007). “Finding community structure in
in community detection,” arXiv:0805.4770v2 [physics.soc-ph] [Online]. mega-scale social networks,” arXiv:cs/0702048v1 [Online]. Available:
Available: http://arxiv.org/pdf/0805.4770v2 http://arxiv.org/pdf/cs.CY/0702048
[34] M. Lipczak and E. Milios, “Agglomerative genetic algorithm for clus- [59] E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjective
tering in social networks,” in Proc. GECCO, 2009, pp. 1243–1250. evolutionary algorithms: Empirical results,” Evol. Comput., vol. 8, no.
[35] S. Lozano, J. Duch, and A. Arenas, “Analysis of large social datasets 2, pp. 173–195, 2000.
by community detection,” Eur. Phys. J. Special Top., vol. 143, no. 1, [60] E. Zitzler and L. Thiele, “Multiobjective evolutionary algorithms: A
pp. 257–259, 2007. comparative case study and the strength Pareto approach,” IEEE Trans.
[36] D. Lusseau, “The emergent properties of dolphin social network,” Biol. Evol. Comput., vol. 3, no. 4, pp. 257–271, Nov. 1999.
Lett. Proc. R. Soc. Lond. B, vol. 270, pp. S186–S188, Nov. 2003.
[37] D. J. C. MacKay, “Information theory,” Inference and Learning Algo-
rithms. Cambridge, U.K.: Cambridge University Press, 2002.
[38] N. Matake, T. Hiroyasu, M. Miki, and T. Senda, “Multiobjective
clustering with automatic k-determination for large-scale data,” in Proc.
Int. GECCO, 2007, pp. 861–868. Clara Pizzuti received the Laurea degree in math-
[39] A. Mukhopadhyay, U. Maulik, and S. Bandyopadhyay, “Multiobjective ematics from the University of Calabria, Cosenza,
genetic algorithm-based fuzzy clustering of categorical attributes,” IEEE Italy.
Trans. Evol. Comput., vol. 13, no. 5, pp. 991–1005, Oct. 2009. She is currently a Senior Researcher with the Insti-
[40] M. E. J. Newman, “Fast algorithm for detecting community structure in tute of High Performance Computing and Network-
networks,” Phys. Rev. E, vol. 69, no. 6, p. 066133, 2004. ing, National Research Council of Italy, Rende, Italy.
[41] M. E. J. Newman, “Modularity and community structure in networks,” Since 1995, she has been a Contract Professor with
Proc. Natl. Acad. Sci. USA, vol. 103, no. 23, pp. 8577–8582, Jun. 2006. the Department of Computer Science, University of
[42] M. E. J. Newman and M. Girvan, “Finding and evaluating community Calabria. In the past, she worked in the research divi-
structure in networks,” Phys. Rev. E, vol. 69, no. 2, p. 026113, 2004. sion of a software company on deductive databases,
[43] Y. J. Park and M. S. Song, “A genetic algorithm for clustering problems,” advanced logic based systems, and abduction. She
in Proc. 3rd Annu. Conf. Genet. Algorithms, 1989, pp. 2–9. has published more than 70 papers in conference proceedings and journals.
[44] C. Pizzuti, “GA-NET: A genetic algorithm for community detection in Her current research interests include knowledge discovery in databases,
social networks,” in Proc. 10th Int. Conf. PPSN, 2008, pp. 1081–1090. data mining, data streams, bioinformatics, e-health, social network analysis,
[45] P. Pons and M. Latapy, “Computing communities in large networks using evolutionary computation, genetic algorithms, and genetic programming.
random walks,” J. Graph Algorithms Applicat., vol. 10, no. 2, pp. 191– Ms. Pizzuti is serving as a program committee member of international
218, 2006. conferences and as a reviewer for several international journals.

You might also like