Figures
Abstract
Community structure is one of the main structural features of networks, revealing both their internal organization and the similarity of their elementary units. Despite the large variety of methods proposed to detect communities in graphs, there is a big need for multi-purpose techniques, able to handle different types of datasets and the subtleties of community structure. In this paper we present OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics. It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations, which is estimated with tools of Extreme and Order Statistics. OSLOM can be used alone or as a refinement procedure of partitions/covers delivered by other techniques. We have also implemented sequential algorithms combining OSLOM with other fast techniques, so that the community structure of very large networks can be uncovered. Our method has a comparable performance as the best existing algorithms on artificial benchmark graphs. Several applications on real networks are shown as well. OSLOM is implemented in a freely available software (http://www.oslom.org), and we believe it will be a valuable tool in the analysis of networks.
Citation: Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding Statistically Significant Communities in Networks. PLoS ONE 6(4): e18961. https://doi.org/10.1371/journal.pone.0018961
Editor: Eshel Ben-Jacob, Tel Aviv University, Israel
Received: December 9, 2010; Accepted: March 14, 2011; Published: April 29, 2011
Copyright: © 2011 Lancichinetti et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: A.L. and S.F. gratefully acknowledge ICTeCollective. The project ICTeCollective acknowledges the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET-Open grant number 238597. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The analysis and modeling of networked datasets are probably the hottest research topics within the modern science of complex systems [1]–[7]. The main reason is that, despite its simplicity, the network representation can disclose some relevant features of the system at large, involving its structure, its function, as well as the interplay between structure and function. The elementary units of the system are reduced to simple points, called vertices (or nodes), while their pairwise relationships/interactions are pictured as edges (or links). It is fairly easy to spot the two main ingredients of a graph in many instances. Therefore networks can be found everywhere: in biology (e. g., proteins and their interactions), ecology (e. g., species and their trophic interactions), society (e. g., people and their acquaintanceships). Other noteworthy examples include the Internet (routers/autonomous systems and their physical and/or wireless connections), the World Wide Web (URLs and their hyperlinks), etc.
The structure of most networks, beneath the intrinsic disorder due to the stochastic character of their generation mechanisms, reveals a high degree of organization. In particular, vertices with similar properties or function have a higher chance to be linked to each other than random pairs of vertices and tend to form highly cohesive subgraphs, which are called communities (also modules or clusters). Examples of communities are groups of mutual acquaintances in social networks [8]–[10], subsets of Web pages on the same subject [11], compartments in food webs [12], [13], functional modules in protein interaction networks [14], biochemical pathways in metabolic networks [15], [16], etc.
Detecting communities in graphs may help to identify functional subunits of the system and to uncover similarities among vertices that are not apparent in the absence of detailed (non-topological) information. Vertices belonging to the same community may be classified according to their structural position within the cluster, which may be correlated to their role. Vertices in the core of the cluster may have a function of control and stability within the module, whereas boundary vertices are likely to be mediators between different parts of the graph. The community structure of a network can also be a powerful visual representation of the system: instead of visualizing all the vertices and edges of the network (which is impossible on large systems), one could display its communities and their mutual connections, obtaining a far more compact and understandable description of the graph as a whole. It is thus not surprising that community detection in graphs has been so extensively investigated over the last few years [17]. A huge variety of different methods have been designed by a truly interdisciplinary community of scholars, including physicists, computer scientists, mathematicians, biologists, engineers and social scientists.
However, most algorithms currently available cannot handle important network features. Many methods are designed to find clusters in undirected graphs, and cannot be easily (or not at all) extended to directed graphs. However, there are many datasets for which edge directedness is an essential feature. Citation networks, food webs and the Web graph are but a few examples. Similar problems arise when edges carry weights, indicating the strength of the interaction/affinity between vertices, although extensions are generally easier in this case.
Likewise, the great majority of algorithms are not capable to deal with the peculiar features of community structure. For example, each vertex is typically assigned to a single cluster, while in several instances, like in social networks, vertices are typically shared between two or more clusters. In such cases communities are overlapping (and partitions become covers) and very few methods account for this possibility [18]–[25], which considerably increases the complexity of the problem. Furthermore, community structure is very often hierarchical, i.e. it consists of communities which include (or are included by) other communities. Hierarchies are common in human societies and are crucial for an efficient management of large organizations. Simon pointed out that hierarchy gives robustness and stability to complex systems, yielding an evolutionary advantage on the long run [26]. However, most community finding methods typically look for the “best” partition of a network, disregarding the possible existence of hierarchical structure. Instead, a method should be able to recognize if there is hierarchical structure and, if yes, identify the corresponding levels [27]–[29].
It is also very important for a method to distinguish communities from pseudo-communities. The existence of clusters indicate a preference by some groups of vertices to link to each other. But, if the linking probability is the same for all pairs of vertices, like in random graphs, no communities are expected. In this case, concentrations of edges within groups of vertices are simply the result of random fluctuations, they do not represent potentially non-trivial structures. Many algorithms are not able to see this difference and find clusters in random graphs as well, although they are not meaningful. Scholars have just begun to assess the issue of significance of clusters [30], [31].
Finally, given the recent availability of time-stamped networked datasets, it is now
possible to carry out quantitative studies on the dynamics of community structure,
about which very little is known [32]–[37]. A simple way to treat dynamic datasets is to analyze
snapshots of the system at different times separately, and then map communities of
different snapshots onto each other, such that one can follow the dynamic of each
cluster in time. However, focusing on individual snapshots means disregarding the
information on the system at previous times. Ideally a partition/cover of the system
at time should be faithful both to its structure at time
and to its history [34], [37].
In this paper we propose the first method able to meet all requirements listed above, the Order Statistics Local Optimization Method (OSLOM). It is a method that optimizes locally the statistical significance of clusters, defined with respect to a global null model. The concept of statistical significance is inspired by recent work of some of the authors [31], [38]. The paper is structured as follows. After introducing the method, we test its performance on artificial benchmark graphs, comparing it with the performances of the best algorithms currently available. Next, we pass to the analysis of real networks, followed by a final discussion on the work. Some of the tests on artificial and real networks are reported in the Supporting Information S1.
Methods
Statistical significance of clusters
In this section we explain how to estimate the statistical significance of a given cluster. OSLOM will use the significance as a fitness measure in order to evaluate the clusters. Following our previous work [31], we define it as the probability of finding the cluster in a random null model, i. e. in a class of graphs without community structure. We choose the configuration model [39] as our null model. This is a model designed to build random networks with a given distribution of the number of neighbors of a vertex (degree). The networks are generated by joining randomly vertices under the constraint that each vertex has a fixed number of neighbors, taken from the pre-assigned degree distribution. This is basically the same null model adopted by Newman and Girvan to define modularity [40].
We start from a graph with
vertices and
edges. The
framework for the analysis is sketched in Fig. 1. We are given a subgraph
, whose significance is to be assessed, a vertex
and the degree of the vertices of the rest of the graph
. The degree of subgraph
is
,
is the degree of
, and the rest of vertices have a total degree
. We can separate the above quantities in the
contributions internal or external to
(
and
); the internal
degree of
is
(Fig. 1).
The subgraph is
embedded within a random graph generated by the configuration model. The
degrees of all vertices of the network are fixed, in the figure we have
highlighted the degrees of
(
), of the
vertex
at the
center of the analysis (
) and of
the rest of the graph
(
). These
quantities are expressed as sums of contributions which are internal to
their own set of vertices (as
) or
related to subgraph
(in or
out). This notation is used in the distribution of Eq. 1.
Let us suppose that is a subgraph of
graphs generated by the configuration model, where each vertex maintains the
degree it has on the graph
at study. We
assume that the internal degree
of the subgraph is
fixed. If all the other edges of the network are randomly drawn, the probability
that
has
neighbors in
can be written as [38]
(1)This equation enumerates the possible
configurations of the network with
connections
between
and
. The factorials of
the formula express the multiplicity of configurations with fixed values of
,
,
and
, whereas the power
of
in the numerator stays for the multiplicity coming from
the permutation of the extremes of edges lying between
and
. Several of the
terms in the expression can actually be written as a function of constants and
, such as
and
. The normalization factor
includes terms not depending on
and ensures that
(2)Further details on the numerical
implementation of the formula in Eq. 1, as well as on the different
approximations taken and their limits, are included in the Supporting
Information S1.
The probability of Eq. 1 provides a tool to rank the vertices external to
according to the likelihood of their topological
relation with the group. If vertex
shares many more
edges with the vertices of subgraph
than expected in
the null model, we could consider the inclusion of
in
, since the
relationship between
and
is “unexpectedly” strong. In order to
perform the ranking the cumulative probability
of having a number
of internal connections equal or larger than
is estimated,
following Ref. [31]. Given that the vertex degree is a discrete variable,
the cumulative distribution has a specific step-wise profile for each value of
. In order to facilitate the comparison of vertices with
different degrees, we implement a bootstrap strategy by assigning to each vertex
a value of
,
, randomly drawn from the interval
. This choice is important for a meaningful
estimate of the clusters' significance; other options (e. g., taking
the middle points of the interval) could lead to the identification of
meaningful clusters in random graphs. The bootstrap introduces a
stochastic element in the assessment procedure, which will, in turn, lead to the
use of Monte Carlo techniques.
The variable bears the information regarding the likelihood of the
topological relation of each vertex with
and has an
important feature: it is a uniform random variable distributed between zero and
one for vertices of our null model graphs. Calculating its order statistic
distributions is thus a relatively easy task. The first candidate among the
external vertices to be part of
is the vertex with
the lowest value of
, that we indicate
. The cumulative distribution of
in the null model is then given by
(3)where
is the number of
vertices in
. In general, let
be the value of
variable
with rank
(in increasing
order of the variable
). Its cumulative
distribution is (Fig.
2):
(4)
The score is the
-th
smallest score of the external vertices. In this particular case there
are
external vertices. In the figure, we plot
,
,
,
,
(from left
to right). As an example, the shaded areas show the cumulative
probability
for a few
values of
that would
correspond to the values estimated in a practical situation. In this
case, the black area,
, is the
least extensive and so
. If
, the
vertices with scores
,
,
and
will be
added to
.
The reason for the use of order statistics is that we assume that clustering
methods tend to include in each community those vertices which are most strongly
connected to vertices of the community. Due to correlations (the vertices in the
clusters tend to be connected), we cannot calculate the statistics of the
internal connections to the clusters, but we can do it safely for the external
vertices. The values of the different inform us of how
much the external vertices of a group are compatible with the statistics
expected in the null model. To evaluate the full group, we define
among all the neighbors of
, where
are their
corresponding ranked values for the
variable. The
distribution of
can be easily
tabulated numerically since it only depends on
. The cumulative
distribution will be denoted as
. In the following,
we call
the score of the cluster
.
Single cluster analysis
Now that a score to evaluate the statistical significance of the clusters has
been introduced, the next step is to optimize the score across the network by
dividing it into proper clusters. We describe first the optimization of a single
cluster score and will extend later the method to deal with the full network.
First of all one has to give the method a certain tolerance, in the following
referred to as . This parameter
establishes when a given value of the score is considered significant. Our
procedure consists of two phases: first, we explore the possibility of adding
external vertices to the subgraph
; second,
non-significant vertices in
are pruned. They
are described below and illustrated schematically in Fig. 3.
- For each vertex
outside
and connected to it by at least one edge the variable
is computed. Then we calculate
for the vertex with the smallest
, by using Eq. 3. If
, we add the corresponding vertex to the subgraph, which we now call
. If
, one checks the second best vertex, the third best vertex, etc. If there is finally a vertex, say the
-th best vertex, for which
, one includes all
best vertices into subgraph
, yielding subgraph
. At this point, no other vertex outside
deserves to enter the community since all the external vertices are compatible with the statistics of the random configuration model. It may also happen that the inequality
above holds for no external vertex, in which case we add no vertices to
and
. Either way, we pass to the second stage with the subgraph
.
- For each vertex
in
the variable
with respect to the set
is estimated. We pick the “worst” vertex
of the cluster, i. e. the vertex with the highest value of
. To check for its significance we repeat step 1 for the subgraph
. If
turns out to be significant, we keep it inside
and the analysis of the cluster is completed. Otherwise,
is moved out of
and one searches for the worst internal vertex of
. At some point we end up with a cluster
, whose internal vertices are all significant and the process stops.
The two-steps procedure is a way to “clean up”
. A cluster is left unchanged only if all the external
vertices are compatible with the null model and all the internal vertices are
not. A few remarks are important here:
- There can be both good vertices outside
and bad ones inside. It is important to perform the complete procedure described above, which guarantees that the final cluster is significant with respect to the present null model (see also Ref. [31]).
- The procedure is not deterministic, because of the stochastic component
in the computation of the cumulative probability
. So one shall repeat all the steps several times. The cluster analysis may deliver a subgraph
, in general different from
, or an empty subgraph. For each vertex
we compute the participation frequency
, defined as the ratio between the number of times
belongs to any non-empty
and the total number of iterations leading to non-empty subgraphs. In general, we consider the subgraph
to be a significant cluster if the single cluster analysis yields a non-empty subgraph
in more than
iterations. The final “cleaned” cluster includes those vertices for which
.
- In the worst-case scenario, the complexity of the cluster analysis scales
with the number of vertices of
, times the number of neighbors of
, times the number of loops needed to have reliable values for the
's. The situation can be considerably improved by keeping track of the order of the external vertices at each step (using suitable data structures) and by computing the score only for some reasonably good vertices. For instance, one could pick just those vertices with
. We numerically checked that changing this threshold does not affect the results, but leads to a faster algorithm.
Network analysis
The previous procedure deals with a single cluster
. It finds the external significant vertices and includes
them into
. It also prunes those internal vertices that are not
statistically relevant. Now we extend this procedure by introducing an algorithm
able to analyze the full network. In order to do so, we follow the method
proposed by some of the authors in Ref. [23]. The starting point
is a single vertex, taken at random, in the absence of any information. Let us
suppose that we start from a random vertex
and that our first
group is
. The method proceeds as follows:
vertices are added to
, considering the most significant among the neighbors of the cluster. The number
is taken from a distribution, which in principle can be arbitrary. We choose a power law with exponent
.
- Perform the single cluster analysis.
We repeat the whole procedure starting from several vertices in order to explore
different regions of the network. This yields a final set of clusters that may
overlap. Such type of local optimization was originally implemented in the Local
Fitness Method [23], to handle overlapping communities. The algorithm
stops when it keeps finding similar modules over and over.
Ideally one wishes to encounter the exact same clusters repeatedly. However, the
stochastic element introduced when calculating the vertex score can lead
vertices, whose score is close to the threshold, to change their group
assignments from one realization to another. This can be a problem when we are
trying to decide whether two groups in different instances correspond to the
same cluster. As a practical rule, we say that two groups
and
are similar if
, in which case they deserve further attention. Indeed,
it turns out that many of the clusters found are very similar or combinations of
each other. This leads to a very important question: given a set of significant
clusters, which ones should be kept?
Let us consider the problem of choosing between two clusters
and
and the union of
the two,
. A solution is to consider the subgraph
of the vertices in
and see if
and
are significant as
modules of
. Strictly speaking we consider
and
which are the
cleaned up clusters within
(i.e. with respect
to subgraph
only, neglecting the rest of the network). We discard
if
, where we set
. Otherwise we discard
and
and we keep the union
. Instead, if we
have to decide among a set of
clusters and their
union, the condition to prefer the submodules is
.
In general, we check if each cluster has significant submodules, by looking for modules in the subgraph given by the cluster and using the condition above to decide which ones to take. This leads to a set of significant minimal clusters, where minimal means that they have no significant internal cluster structure, according to the condition above. We also need to check whether unions of such minimal clusters do have internal cluster structure, according to our rule, to decide whether the clusters have to be kept separated or merged. After doing this, we still end up with many similar modules. Given a pair of similar modules (in the sense defined above), we first check if their union has significant cluster structure: if it does not, we merge the two clusters, otherwise we systematically prefer the bigger one (if they are equal-sized, we pick the cluster with smaller score).
After the completion of this procedure, the output is a cover of the network. To reduce the stochasticity introduced by the bootstrap, the procedure is repeated in order to obtain several covers. All clusters of the covers are analyzed as described above to select among them the ones which will appear in the final output.
The parameter values may affect the outcome of OSLOM. The value of the
significance level plays an important
role for the determination of the size of the clusters found by OSLOM. In
general, small values of
lead to the
identification of large clusters, and large values of
allow the identification of small clusters. Likewise,
large values of the parameter
, which controls
the internal structure of modules, generally lead to the identification of large
clusters. The influence of the parameter values is however relevant only when
the community structure of the network is not pronounced. When modules are well
defined, the results of OSLOM do not depend on the particular choice of the
parameter values.
OSLOM
We have described the cleaning of a single cluster and how the full network is analyzed. In the following, all the ingredients are assembled together to form the algorithm that we call OSLOM (Order Statistics Local Optimization Method). A flux diagram summarizing how it works can be seen in Fig. 4. OSLOM consists of three phases:
- First, it looks for significant clusters, until convergence;
- Second, it analyzes the resulting set of clusters, trying to detect their internal structure or possible unions thereof;
- Third, it detects the hierarchical structure of the clusters.
The levels of grey of the squares represent different loop levels. One can provide an initial partition/cover as input, from which the algorithm starts operating, or no input, in which case the algorithm will build the clusters about individual vertices, chosen at random. OSLOM performs first a cleaning procedure of the clusters, followed by a check of their internal structure and by a decision on possible cluster unions. This is repeated with different choices of random numbers in order to obtain better statistics and a more reliable information. The final step is to generate a super-network for the next level of the hierarchical analysis.
To speed up the method, one can start from a given partition/cover delivered by another (fast) algorithm or from a priori information. In those cases, the first step will be to clean up the given clusters.
Once the set of minimal significant clusters has been found, the analysis of the
hierarchies consists of the following steps. We construct a new network formed
by clusters, where each cluster is turned into a supervertex and there are edges
between supervertices if the representative clusters are linked to each other.
The resulting superedges are weighted by the number of edges between the initial
clusters. There is the problem of properly assigning edges between clusters, if
the edges are incident on overlapping vertices. Suppose to have an edge whose
endvertices and
belong to
and
clusters,
respectively. This edge lies simultaneously between any pair of clusters
and
, with
including
and
including
. The contribution
of the edge to the superedge between
and
equals
. The resulting
non-integer weights may lead to non-integer values for the weight of superedges,
whereas we need integer values in order to use Eq. 1. For this reason, the
weight of each superedge is rounded to the nearest integer value. We stress that
the weight we deal with here indicates just how to “split” edges, it
is not related to the weight that edges may carry. If the original network is
weighted, the rescaled weight of an edge is
,
being the weight of the edge in the network. Once the
supernetwork has been built, one applies the method again, obtaining the second
hierarchical level. The latter is turned again into a supernetwork, as we
explained above, and so on, until the method produces no clusters. In this way
OSLOM recovers the hierarchical community structure of the original graph.
We will describe next the main features of OSLOM, and what it adds to the state of the art in community detection.
Significant clusters.
The main characteristic of OSLOM is that it is based on a fitness measure,
the score, that is tightly related to the significance of the clusters in
the configuration model. In fact, the single cluster analysis is designed to
optimize the cluster significance as defined in Ref. [31]. Therefore the
output of OSLOM consists of clusters that are unlikely to be found in an
equivalent random graph with the same degree sequence. The tolerance
, fixed initially, determines whether such clusters
are “unexpectedly unlikely”, and therefore significant, or not.
So, if the method is fed with a random graph, the output will include very
few clusters or even none at all.
Homeless vertices.
The vertices in a random network will be deemed as homeless. Homeless vertices are those that are not assigned to any cluster. This is a very important feature that OSLOM includes. The presence of random noise or non-significant vertices is an issue that may occur in many real systems. However, very few clustering techniques take into account this possibility. In OSLOM, it comes as a natural output. We will quantitatively analyze this feature when we test the method on benchmark graphs.
Overlapping communities.
A natural output of OSLOM is the possibility for clusters to overlap. Since each cluster is “cleaned” independently of the others, a fraction of its vertices may belong also to other clusters, eventually. We will show the efficiency of OSLOM in unveiling overlapping vertices in suitably designed benchmarks.
Cluster hierarchy.
Another relevant feature of OSLOM is the analysis of the hierarchical structure of the clusters. As mentioned above, the third phase of our method includes a procedure to take care of this issue. The results are very good on hierarchical benchmarks.
OSLOM generally finds different depths in different hierarchical branches. In fact, when the algorithm is applied not all vertices are grouped, as some of them are homeless. The coexistence of homeless vertices with proper clusters yields a hierarchical structure with branches of different depths.
Weighted networks.
OSLOM can be generalized to weighted graphs as well. We assume that the
contributions to the probability of having a connection between two vertices
and
with a certain
weight
, given the vertex degrees
and
and their
strengths,
and
, is separable in two different terms in the
configuration model: one for the topology and another for the weight [38]. The
strength of a vertex is defined as the sum of the weights of all the edges
incident on it. We approximate the weight contribution
by
(5)where
is the harmonic mean of the average weights of
vertices
and
, defined as
and
, respectively.
The idea behind this expression is that the weight of an edge of the null
model should be proportional to the average weight of its endvertices. We
proposed the harmonic average because it is more sensitive to the small
values of
.
We use this distribution to define a new variable
, accounting for the probability of having a certain
weight on a given edge with the strengths of the vertices and the general
weight distribution known. We combine this variable
with its topological counterpart,
, obtaining a new variable
. This is a non-trivial task since both probabilities
are defined on a different set of elements (see the Supporting
Information S1). For
we can
estimate, as before, the order statistic distributions and we proceed just
as we do for unweighted graphs.
Directed graphs.
OSLOM can be easily generalized to handle directed graphs. For that, we need
to define two uniformly distributed random variables
and
. The former is
based on the probability that vertex
has outgoing
edges ending on vertices of the given subgraph
, the latter is based on the probability that
has incoming edges originating from vertices of
. These two probabilities are computed through
analogous formulas as in Eq. 1 or numerical approximations to it. The final
score of vertex
is given by
the product
. We are able
to calculate the distribution of this product and therefore to estimate its
order statistics (just as for the weighted case, see Section 1.1. of Supporting
Information S1). The rest of the clustering method proceeds as
explained above. If graphs have edges with both directions and weights, we
have four variables for each vertex:
,
and the corresponding versions for the weights. The
final score is given again by the product of these four variables.
Dynamical networks.
Time-stamped networked datasets are usually divided into snapshots,
condensing the relational information between vertices within different time
windows. Snapshots are typically analyzed separately, whereas it would be
more informative to combine the information from different time slices. For
instance, consider two snapshots and
at times
and
, respectively. A simple idea is to find the
partition/cover of the network at time
, by applying
the method to the corresponding snapshot, and to use the result as an input
for the application of the method to the network at time
. In this way one can see how the community structure
at time
“evolves” to that at time
. This is a rather general approach, it can be
adopted for other algorithms for community detection, like greedy
optimization techniques. OSLOM has the useful property that it can start
from any initial partition/cover, which can be given as input. In this way
the clusters found in
can be used as
initial condition for the analysis of
. With this
approach, the new partition/cover is closer to that in
and we are able to track the groups' evolution.
Naturally, if the two snapshots are very different from each other (because
they refer to times between which the system has changed considerably, for
instance), OSLOM produces a partition/cover in
that is uncorrelated with that of
.
Complexity.
The complexity of OSLOM cannot be estimated exactly, as it depends on the
specific features of the community structure at study. Therefore we carried
out a numerical study of the complexity, whose results are shown in Fig. 5. We apply the
method on the LFR benchmark [41], that we have
used extensively to test the performance of OSLOM. We have used both the
standard version of the algorithm and a fast implementation, in which the
algorithm acts on the partition delivered by a quick method. For each
version we have considered undirected and unweighted LFR benchmark graphs
with two different levels of mixtures between the clusters
( and
, corresponding
to well separated and well mixed clusters). The other parameters needed to
build the LFR benchmark graphs are the same as for the graphs used in Fig. 6. The diagram of
Fig. 5 shows the
execution time (in seconds) as a function of the number
of vertices of the graphs. The processes were run on
a workstation HP Z800. The time scales as a power law of
with good approximation, if the graphs are not too
small. The behavior seems to depend neither on how mixed communities are,
nor on the particular implementation of the algorithm (there seems to be
just a factor between the corresponding curves). Power law fits of the
large-N portion of the curves yield an exponent
, which implies that the complexity is essentially
linear in this case.
The diagram shows how the execution time of two different implementations of the algorithm scales with the network size (expressed by the number of vertices), for LFR benchmark graphs.
The parameters of the graphs are: average degree
,
maximum degree
,
exponents of the power law distributions are
for
degree and
for
community size, S and B mean that community sizes are in the range
(“small”) and
(“big”), respectively. We considered two network sizes:
(top)
and
(bottom). The two curves refer to OSLOM (diamonds) and Infomap
(circles).
Results
Artificial networks
In this section we test OSLOM against artificial benchmarks, comparing its performance with those of the best algorithms currently available. We mostly adopted the LFR benchmark [41], [42], a class of graphs with planted community structure and heterogeneous distributions of vertex degree and community size. Tests on the well known Girvan-Newman (GN) benchmark [8] are shown in the Supporting Information S1. In this section we present tests on undirected and unweighted networks, with and without hierarchical structure and overlapping communities. We also show how OSLOM handles the presence of randomness in the graph structure. Tests on weighted networks and on directed networks can be found in the Supporting Information S1.
In the following sections, for each network, we compose the results of 10 iterations for the network analysis for the first hierarchical level and the results of 50 iterations for higher levels, if any. The single cluster analysis was repeated 100 times for each cluster.
LFR benchmark.
The LFR benchmark [41], [42], like the GN
benchmark, is a particular case of the planted
-partition model
[43],
which is the simplest possible model of networks with communities. The
planted
-partition model is a class of graphs whose vertices
are divided into
equal-sized
groups, such that the probability that two vertices of the same group are
linked is
, while the probability that two vertices of
different groups are linked is
, with
. The planted
-partition
model is too simple to describe real networks. Vertices have essentially the
same degree and communities have the same size, at odds with empirical
analysis showing that both features typically are broadly distributed [19], [44]–[48]. Therefore we
have recently proposed a generalization of the model, the LFR benchmark, by
introducing power-law distributions for the vertex degree and the community
size, with exponents
and
, respectively [41]. The LFR
benchmark poses a far harder challenge to algorithms than the benchmark by
Girvan and Newman, which is regularly used in the literature, and is more
suitable to spot their limits. We are of course aware that the communities
of the model are still too simple to match the communities of real networks.
Other features should be introduced, to tailor the model graphs onto the
real graphs. This is certainly doable, and could be specialized to the
particular domain of applicability one is interested in. Still, the clusters
of the LFR benchmark are a much better proxy of real communities than the
clusters of other benchmark graphs.
Vertices of the LFR benchmark have a fixed degree (in this case taken from
the given power law distribution), so the two parameters
and
of the planted
-partition model are not independent and we choose as
independent variable the mixing parameter
, which is the ratio of the number of external
neighbors of a vertex by the total degree of the vertex. Small values of
indicate well separated clusters, whereas for higher
and higher values communities become more and more mixed to each other.
As a term of comparison we used Infomap [49], which has proved to
be very accurate on artificial benchmark graphs [50]. Fig. 6 shows the
comparative performance of OSLOM and Infomap on the LFR benchmark, with
undirected and unweighted edges and non-overlapping clusters. As a measure
of similarity between the planted partition and that recovered by the
algorithm we adopted the Normalized Mutual Information (NMI) [51], in the
extended version proposed in Ref. [23], which enables
one to compare both partitions and covers. We used this definition also for
hard planted partitions, since modules found by OSLOM may be overlapping. In
all tests on artificial graphs each point is always an average over
realizations.
The plots correspond to two network sizes, and
, and two ranges of community size,
(“small”) and
(“big”), that we indicate with the
letters S and B, respectively. In this way we can check how much the
performance of the algorithm is affected by the network size and the average
size of the communities. The other network parameters are given in the
caption. From the plots we conclude that OSLOM and Infomap have a basically
equivalent performance.
It is important to test the performance of the algorithms on large graphs as
well, given the increasing availability of large networked datasets. The
question is if and how their performance is affected by the network size.
Fig. 7 shows that
both OSLOM and Infomap are effective at finding communities on large LFR
graphs. We remark that the inferior accuracy of OSLOM when communities are
better defined comes from the fact that the method occasionally finds
homeless vertices, i.e. vertices that are not significantly linked to any
cluster. These are vertices that happen not to have a significant excess of
neighbors within their community with respect to the number of neighbors in
the other communities, despite the fact that the average number of internal
neighbors is high. This happens because of fluctuations, and the method
judges such vertices as not belonging to any group, which makes sense. This
issue of the homeless vertices is a general feature of OSLOM. One should not
judge it negatively, though. If a vertex happens to
have a number of external neighbors which is appreciably higher than the
expected external degree of the vertex
, the condition
of the planted
-partition
model does not hold, so in principle the vertex should not be put in its
original community. The confusion derives from the fact that the condition
holds on average.
The network sizes are (left)
and
(right), the maximum degree
and
the community size ranges from
to
. The
other parameters are the same as those used for the graphs of Fig. 6. The two
curves refer to OSLOM (diamonds) and Infomap (circles).
LFR benchmark with overlapping communities.
The LFR benchmark also accounts for overlapping communities, by assigning to each vertex an equal number of neighbors in different clusters [42]. To simplify things, we assume that each vertex belongs to the same number of communities. We cannot use Infomap for the comparison, as it delivers “hard” partitions, without overlaps between clusters. So we used two recent methods, that have a good performance on LFR graphs with overlapping communities: COPRA [52], based on label propagation [53], and MOSES [54], based on stochastic block modeling [55]. COPRA and MOSES are more efficient to detect overlapping communities in LFR benchmark graphs than the popular Clique Percolation Method (CPM) [19], which is the reason why we do not use the CPM here. In Fig. 8 we show how the performance of each method decays with the fraction of overlapping vertices, for different choices of the mixing parameter and for the small (S) and big (B) communities defined above. Since in social networks there may be many vertices belonging to several groups, we also considered the extreme situation of graphs consisting entirely of overlapping vertices. In this case, by increasing the number of memberships of the vertices communities become more fuzzy and it gets harder and harder for any method to correctly identify the modules. From Fig. 8 we deduce that OSLOM significantly outperforms COPRA in both tests and MOSES in the test with overlapping and non-overlapping vertices, while the performances of OSLOM and MOSES are quite close when all vertices are overlapping.
The parameters are: ,
,
,
,
. S and
B indicate the usual ranges of community sizes we use:
and
,
respectively. We tested OSLOM against two recent methods to find
covers in graphs: COPRA [52] and MOSES
[54]. The left panel displays the normalized
mutual information (NMI) between the planted cover and the one
recovered by the algorithm, as a function of the fraction of
overlapping vertices. Each overlapping vertex is shared between two
clusters. The four curves correspond to different values of the
mixing parameter
(
and
) and
to the community size ranges S and B. The right panel shows a test
on graphs whose vertices are all shared between clusters. Each
vertex is member of the same number of clusters. The plot shows the
NMI as a function of the number of memberships of the vertices. Each
curve corresponds to a given value of the average degree
. The
graph parameters are
,
,
,
,
.
Community sizes are in the range
.
Hierarchical LFR benchmark.
OSLOM is capable to handle hierarchical community structure as well. To test
its performance we have designed an algorithm that produces a version of the
LFR benchmark with hierarchy. To keep things simple, we consider a two-level
hierarchical structure (Fig.
9). The idea is to use the wiring procedure of the original
algorithm twice, first for the micro-communities and then for the
macro-communities. In order to do so, we need two mixing parameters:
, the fraction of neighbors of each vertex belonging
to different macro-communities;
, the fraction
of neighbors of each vertex belonging to the same macro-community but to
different micro-communities.
Stars indicate overlapping vertices.
The question is whether the algorithm is able to recover both planted
partitions of the benchmark, which we call Fine
(micro-communities) and Coarse (macro-communities). The
partitions found by the algorithm can be one, two or more, we call them
partition . In the test, whose results are illustrated in Fig. 10, we compare the
Fine partition with partition 1 (Fine 1), the Coarse partition with
partition 2 (Coarse 2), and the Coarse partition with partition 1 (Coarse
1). We compare OSLOM with a recent extension of Infomap to networks with
hierarchical community structure [56]. In the plots we show
how the similarity of the three pairs of partitions mentioned above varies
by increasing
but keeping
constant (we picked the values
,
,
,
). For a better
comparison of the panels we put on the x-axis the sum
, representing the fraction of neighbors of a vertex
not belonging to its micro-community. We find that, when
increases, the Fine partition becomes difficult to
resolve and, for
, it cannot be
found anymore and both algorithms can only find the Coarse partition.
Instead, for smaller value of
, the
algorithms can recover both levels. OSLOM performs better than Infomap if
is not too small.
We compare three pairs of partitions: the lowest hierarchical
partition found by the algorithm (indicated by
) with
the set of micro-communities of the benchmark (Fine); the lowest
hierarchical partition found by the algorithm with the set of
macro-communities of the benchmark (Coarse); the second lowest
hierarchical partition found by the algorithm (indicated by
) with
the set of macro-communities of the benchmark. The corresponding
similarities are plotted as a function of
, for
fixed
. There
are
vertices, the average degree
, the
maximum degree
, the
size of the macro-communities lies between
and
vertices, the size of the micro-communities lies between
and
vertices. The exponents of the degree and community size
distributions are
and
.
Random graphs and noise.
We check whether OSLOM is also able to recognize the absence, and not simply the presence, of community structure. In random graphs vertices are connected to each other at random, modulo some basic constraints like, e. g., keeping some prescribed degree distribution or sequence. In this way, there are by definition no groups of vertices that preferentially link to each other, so there are no communities. There may be subgraphs with an internal edge density higher than the average edge density of the whole network, but they originate from stochastic fluctuations (noise). A good community finding algorithm should be able to recognize that such subgraphs are false positives, and discard them. Here we want to see if OSLOM distinguishes “order” from “noise”. For this purpose, we carried out two tests.
In Fig. 11 we applied
OSLOM and Infomap to Erdös-Rényi random graphs [57] and
scale-free networks [58]. The goal is to see whether the algorithms
recognize that there are no actual communities. Good answers are the
partition with as many communities as vertices, or the partition with all
vertices in the same community. Let us call the partition
found by the algorithm at hand. Clusters in
containing at
least two vertices and smaller than the whole network indicate that the
method has been fooled. The fraction of graph vertices belonging to those
clusters is a measure of reliability: the lower this number, the better the
algorithm. In Fig. 11 we
show this variable as a function of the average degree
of the random graphs we considered. For OSLOM it
remains very low for all values of
. This is not
surprising, since OSLOM estimates the statistical significance of clusters,
and is therefore ideal to detect stochastic fluctuations. Infomap instead
finds many non-trivial clusters when
is low,
whereas it correctly recognizes the absence of community structure if
increases.
We plot the fraction of vertices belonging to non-trivial clusters
(i.e. to clusters with more than one and less than
vertices, where
is as
usual the size of the graph), as a function of the average degree of
the graph. The curves correspond to Erdös-Rényi graphs
(diamonds) and scale-free networks (circles). All graphs have
vertices. The only parameter needed to build Erdös-Rényi
graphs is the probability that a pair of vertices is connected,
which is determined by the average degree
. The
scale-free networks were built with the configuration model [39], starting from a fixed degree sequence for
the vertices obeying the predefinite power law distribution. The
parameters of the distribution are: degree exponent
,
maximum degree
.
The second test deals with graphs consisting of an ordered
part, with well-defined clusters, and a noisy part,
consisting of vertices randomly attached to the rest of the network. The
ordered part is an LFR benchmark graph with vertices and
represents the starting configuration of our system. The noisy vertices (up
to
in number) are successively added in sequence, and a
newly added vertex is linked to the other ones via preferential attachment
[58].
The initial degree of the noisy vertices is drawn from a power law
distribution with
and exponent
. We measure two things, as a function of the number
of noisy vertices: the similarity between the set of noisy vertices and the
set of homeless vertices found by OSLOM, which is expressed by the Jaccard
Index [59]
(Fig. 12, left); the
similarity between the planted partition of the ordered part of the graph
and the subset of the partition found by OSLOM including (only) the vertices
of the ordered part, which is expressed by the normalized mutual information
(Fig. 12, right). We
compare OSLOM with Infomap and COPRA [52]. We find that OSLOM
correctly separates the clusters and the noise up to a number of about
noisy vertices, which represent almost a third of
the whole network. Infomap and COPRA, instead, do not recognize the noisy
vertices, no matter how small their number is. Also, they tend to mix noisy
vertices with the clusters of the planted partition of the ordered part, as
shown by the fact that the partition they recover never exactly match the
planted partition, not even when just a few noisy vertices are present.
These results are actually understandable in the case of Infomap, which is
based on the minimization of the code length required to describe random
walks taking place on the graph: singletons (clusters consisting of single
vertices) are generally not admitted because they increase the amount of
information required to map the process, due to the high number of
transitions of the walker from the singletons to the rest of the graph and
back.
The communities are those of an LFR benchmark graph (undirected,
unweighted and without overlapping clusters), with
,
,
,
. The
cluster size ranges from
to
vertices. The noise comes by adding vertices which are randomly
linked to the existing vertices, via preferential attachment. The
test consists in checking whether the community finding algorithm at
study (here OSLOM, Infomap and COPRA) is able to find the
communities of the planted partition of the LFR benchmark and to
recognize as homeless the other vertices.
Real networks
In this section we discuss the application of OSLOM to networks from the real world. In Table 1 we list the networks considered in our analysis, along with some basic statistics obtained from the detection of their community structure with OSLOM.
We analyzed different types of systems: social, information, biological and infrastructural networks. Here we discuss only some of them, the rest of the analysis can be found in the Supporting Information S1.
The word association network.
This network is built on the University of South Florida Free Association
Norms [60]. Here the presence of an edge between words
and
indicates that
some people associate
to the word
. This network is considered a paradigmatic example
of graph with overlapping communities [19], since several words may
have various meanings and belong to different groups of words. In Fig. 13 we see a few
subgraphs of the word association network, revolving around four keywords:
bright, knowledge,
music and play. We see that the
keywords are shared among several clusters, which are semantically highly
homogeneous. For instance, bright belongs to three groups,
centered on the words color, shine and
smart, respectively, which makes sense. In the same
subgraph, the words sun and dark are also
overlapping vertices, belonging to the groups of color and
shine, as one might expect. In the subgraph centered on
knowledge, one distinguishes the groups referring to
the words mind, intelligent,
expert and college/university. Here
there are many overlapping vertices, like the word
intelligence, shared between the groups of
mind and intelligent, and a bunch of
terms indicating (mostly) professional status within schools and/or
universities, like student, professor,
teacher, etc., which lie between the groups of
expert and college/university. In the
third subgraph, the word music is shared by the groups of
instrument, song/dance and
noise/sound: other overlapping vertices are the words
sing and voice, lying between
song/dance and noise/sound, and the
words bass and saxophone, belonging to the
groups of song/dance and instrument.
Finally, the word play sits between the communities of
sport, music and
youth/kid; other overlapping vertices in this subgraph
include game, children,
toy, etc.
Stars indicate overlapping vertices.
UK commuting.
This is the network of flows of commuters between areas of the United
Kingdom, and therefore it has a clearly geographic character. It is composed
of vertices, each representing a ward, i. e. a
geographical division used in the UK census for statistical purposes. The
whole territory of the United Kingdom is divided into wards. Each edge
corresponds to a flow of commuters between the ward of origin and that of
destination, with a weight accounting for the number of commuters per day.
The data were collected during the
UK census,
when the ward of residence and the ward of work/study was registered for a
sizeable part of the British population. The database can be accessed online
at the site of the Office for National Statistics http://www.ons.gov.uk/census. OSLOM finds three hierarchical
levels (Fig. 14). The
clusters of the second level delimit geographical areas typically centered
about one major town. In the highest level the areas of England, Wales,
Scotland and Northern Ireland are clearly recognizable. Interestingly,
Northern Ireland and Scotland are parts of the same community, due to the
large flow of commuters between the two regions, despite the geographical
separation. Black points represent overlapping vertices.
Black points indicate overlapping vertices.
LiveJournal and UK Web.
We also applied OSLOM to two large networks. The first is a network of
friendship relationships between users of the on-line community
LiveJournal (www.livejournal.com),
and was downloaded from the Stanford Large Network Dataset Collection
(http://snap.stanford.edu/data/). The second is a crawl of
the Web graph carried out by the Stanford WebBase Project (http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/),
within the UK domain (.uk). We remind that the Web graph is a directed graph
whose vertices are Web pages, while the edges are the hyperlinks that enable
one to surf from one page to another. These two systems are too large for
OSLOM, due to the huge variety of possible cluster sizes to explore.
Therefore we applied a two-step method: in the first step, we derived an
initial partition with the
Louvain method [61], which is able to handle large networked
datasets; in the second step, we apply OSLOM to refine the clusters of
. In principle, this procedure should yield the same
partitions/covers as applying OSLOM directly, if one repeated OSLOM's
cluster search many times. But this would make the calculations too lengthy,
so, in order to complete the analysis within a reasonable time, it is
necessary to keep the number of iterations low. In this way there is the big
advantage of drastically reducing the computational complexity, which makes
large systems tractable, even if results would be more accurate if one could
apply OSLOM from scratch. Clearly, since different iterations are
independent processes, one could sensibly increase the statistics by
distributing the iterations among different processors, if available.
In Fig. 15 we present the
distribution of cluster sizes of the first two hierarchical levels found by
OSLOM. The results are obtained by performing a single iteration on a
workstation HP Z800. For the Web graph, which is the larger system, with
nearly million vertices and
million edges
(see Table 1), the
analysis was completed in about
hours. For the
social network of LiveJournal we can compare the results
with the corresponding distributions found by Infomap and the Label
Propagation Method (LPM) proposed by Leung et al. [62], which were computed in
a recent analysis [48]. In that work the original Infomap was used, so
neither Infomap nor the LPM could detect hierarchical community structure
and there is just one cluster size distribution, corresponding to the single
partition recovered. The distributions are broad and quite similar across
different methods. Interestingly, the two hierarchical levels of
LiveJournal (OSLOM 1 and OSLOM 2) are not too
different, indicating a sort of self-similarity of the community structure.
For the Web the two levels are more dissimilar and the distributions have a
clear power law decay (with different exponents) up to a cutoff, which is
approximately the same for both curves (
vertices).
We show the distribution of cluster sizes obtained by OSLOM for the first two hierarchical levels (OSLOM 1 and OSLOM 2). For LiveJournal we can compare the distributions with those found with Infomap [49] and the Label Propagation Method (LPM) by Leung et al. [62].
Dynamic datasets: the US air transportation network.
For the last application, we used a time-stamped dataset, the US air
transportation network. The data can be downloaded from the Bureau of
Transportation Statistics (US government) (http://www.bts.gov). Vertices
are airports in the USA and edges are weighted by the number of passengers
transported along the corresponding routes. In Fig. 16 we show the geographical location
of the airports and their communities, indicated by the symbols, for three
snapshots, corresponding to the traffic in March, June and September 2009,
respectively. We remind that for dynamical datasets we usually take the
partition/cover of the system
at time
, and we use it as initial partition/cover for the
topology of the system at time
, which is then
refined by OSLOM, in order to “adapt”
to the current structure. This is done to exploit
the information of more snapshots at the same time. Since the three maps of
Fig. 16 are mostly
illustrative, communities were derived by applying directly OSLOM to the
corresponding snapshots, for simplicity. The diagram indicates the
similarity between networks and their corresponding partitions/covers in
different snapshots. Each snapshot represents the whole traffic of one
trimester, which corresponds to a season, while
year, as we want to measure the variation of the
network structure in consecutive seasons. The similarity between
partitions/covers is computed with the normalized mutual information, as
usual. The similarity of two weighted networks like the ones at study is
measured in the following way. First, one computes the distance
between the matrices
and
:
. The matrix
is derived from the standard weight matrix
by dividing each edge weight by the sum of all edge
weights. This is done because the traffic flows tend to increase steadily in
time, so comparing the original weight matrices is not appropriate. The
quantity
is a dissimilarity measure. We turn it to a
similarity index by changing its sign, adding a constant and rescaling the
resulting values. Since we wish to compare the trend of the network
similarity with that of the partition/cover similarity, the additional
constant and the rescaling factor are chosen such to reproduce the average
and the variance of the curve of the normalized mutual information. After
this operation, the two trends are finally comparable. The diagram shows
that both measures follow a yearly periodicity, with peaks corresponding to
the winter season, which is then more stable than the others.
The maps show the position of the airports, which are represented by
symbols, indicating the communities found by applying OSLOM directly
to the corresponding network, without exploiting the information of
previous snapshots. The diagram shows the “seasonality”
of air traffic. The normalized mutual information (diamonds) was
computed comparing the cover of the system at time
adjusted by OSLOM on the network at time
, and
the cover obtained by applying OSLOM directly to the system at time
. The
circles are estimates of the similarity of the network matrices of
snapshots separated by
(one
year). For each year we took four snapshots, by cumulating the
traffic of each trimester. The most stable networks are typically in
winter (vertical lines).
Discussion
We have introduced OSLOM, the first method that finds clusters in networks based on their statistical significance. It is a multi-purpose technique, capable to handle various types of graphs, accounting for edge direction, edge weights, overlapping communities, hierarchy and network dynamics. Therefore, it can be used for a wide variety of datasets and applications.
We have thoroughly tested OSLOM against the best algorithms currently available on
various types of artificial benchmark graphs, with excellent results. In particular,
OSLOM is superior on directed graphs and in the detection of strongly overlapping
clusters. Moreover, it is an ideal method to recognize the absence of community
structure and/or the presence of randomness in graphs. In some cases OSLOM returns
slightly less accurate results than other methods, because it finds several homeless
vertices when communities are fuzzy. This is due to the fact that, in the
realizations of benchmark graphs, it may happen that some vertices end up having the
same number of neighbors (or even more) in other communities than in their own, due
to fluctuations, even if on average this does not happen. So, the classification of
those vertices, imposed by the planted
-partition model, is not justified topologically. This is an
important general issue that needs to be assessed in the future, to avoid systematic
errors in the testing procedure.
OSLOM is a local algorithm, so it respects the nature of community structure, which
is a local feature of networks, the more so the larger the systems at study.
However, the null model adopted to estimate the statistical significance of clusters
is the configuration model, which is global. This is the same null model adopted in
modularity optimization [63], and is responsible for the serious problems of this
technique, like its well known resolution limit [64]. Therefore we perform an
iterative cluster search within the clusters found after the first application of
the method, by considering each cluster as a network on its own. In this way we
progressively limit the horizon of the part of the network under exploration, and we
are able to find the smallest significant clusters, which are the natural building
blocks of the network and the basis of its hierarchical community structure. So the
null model, originally global, gets confined to smaller and smaller portions of the
graph. The actual resolution of the method is thus not due to the null model, but to
the choice of the threshold . In this paper we have
set
, which is often used in various contexts and delivers an
excellent performance on the benchmark graphs we have adopted. Nevertheless, how
much a real graph deviates from a random graph depends on the
specific system at hand, and it would be more appropriate to estimate the threshold
case by case. This is an issue to consider for future work.
We remark that also for modularity optimization one could in principle iteratively
restrict the null model to the clusters found by the method. However, modularity is
based on the expected value of variables estimated on the null
model, neglecting random fluctuations, which is why modularity can attain large
values on specific partitions of random graphs [65]–[67]. OSLOM instead accounts for
those fluctuations, so it is far more reliable, in this respect. Furthermore OSLOM
is a local method, so it does not suffer from the severe problems coming from
modularity's global optimization [68].
Another important aspect to emphasize is the need to perform many iterations, to get more accurate results. This is not a specific feature of OSLOM, but it should be done for all community detection techniques with a stochastic character, like methods based on optimization (e. g., modularity optimization). In the literature there is the general attitude to perform a single iteration, and to reduce the complexity of an algorithm to the time required to carry out one iteration. But this is not appropriate, especially on large networks. For instance, by performing a single iteration, vertices lying on the border between clusters may be assigned to a specific cluster, while in many cases they are overlapping. By combining the results of several iterations, instead, it is more likely to distinguish overlapping vertices from the others. Furthermore, one can compute the strength of the membership of vertices in different clusters, from the frequency with which they were classified in each cluster. One can also disambiguate stable from unstable clusters, which could be recovered from specific iterations. So, it is crucial to collect and combine the results of many iterations. Of course, the complexity of the method grows with the number of iterations, but it can be considerably reduced by distributing runs among many different processors, if large computer clusters are available.
The running time of OSLOM is dominated by the exhaustive search of significant vertices, inside and outside the clusters. This search could be carried out with greedy approaches, with a huge computational advantage, and this is an improvement we plan to implement in the near future. On the other hand, if one wishes to attack very large graphs, OSLOM could be used at a second stage, as a refinement technique, to clean the results of an initial partition delivered by a fast algorithm. In this case, since the initial clusters are usually cores or parts of the significant clusters we are looking for, OSLOM converges far more rapidly than its direct application without inputs. We have seen in the previous section that, by combining OSLOM with the Louvain method by Blondel et al., we were able to handle systems with millions of vertices.
We have proposed a recipe to deal with the increasingly more important issue of detecting communities in dynamic networks. The idea is to take advantage of the information of different snapshots at the same time, by “adapting” the partition/cover of the earlier snapshot to the topology of the other one. In this way it is possible to uncover the correlation between the structures of the system at different time stamps.
We have shown the versatility of OSLOM by applying it to various networked datasets. OSLOM provides the first comprehensive toolbox for the analysis of community structure in graphs and is an ideal complement of existing tools for network analysis. The algorithm, with all its variants (including a fast two-step procedure for the analysis of very large networks) is implemented in a freely downloadable and documented software (http://www.oslom.org).
Author Contributions
Conceived and designed the experiments: AL FR JJR SF. Performed the experiments: AL. Analyzed the data: AL. Contributed reagents/materials/analysis tools: AL FR JJR SF. Wrote the paper: AL FR JJR SF.
References
- 1. Albert R, Barabási AL (2002) Statistical mechanics of complex networks. Rev Mod Phys 74: 47–97.
- 2. Dorogovtsev SN, Mendes JFF (2002) Evolution of networks. Adv Phys 51: 1079–1187.
- 3. Newman MEJ (2003) The structure and function of complex networks. SIAM Rev 45: 167–256.
- 4.
Pastor-Satorras R, Vespignani A (2004) Evolution and Structure of the Internet: A Statistical Physics
Approach. New York, , NY, USA: Cambridge University Press.
- 5. Boccaletti S, Latora V, Moreno Y, Chavez M, Hwang DU (2006) Complex networks: Structure and dynamics. Phys Rep 424: 175–308.
- 6.
Barrat A, Barthélemy M, Vespignani A (2008) Dynamical processes on complex networks. Cambridge, UK: Cambridge University Press.
- 7.
Caldarelli G (2007) Scale-free networks. Oxford, UK: Oxford University Press.
- 8. Girvan M, Newman ME (2002) Community structure in social and biological networks. Proc Natl Acad Sci USA 99: 7821–7826.
- 9. Lusseau D, Newman MEJ (2004) Identifying the role that animals play in their social networks. Proc Royal Soc London B 271: S477–S481.
- 10.
Adamic LA, Glance N (2005) The political blogosphere and the 2004 u.s. election: divided
they blog. LinkKDD '05: Proceedings of the 3rd international workshop on Link
discovery. New York, , NY, USA: ACM Press. pp. 36–43.
- 11. Flake GW, Lawrence S, Lee Giles C, Coetzee FM (2002) Self-organization and identification of web communities. IEEE Computer 35: 66–71.
- 12. Pimm SL (1979) The structure of food webs. Theor Popul Biol 16: 144–158.
- 13. Krause AE, Frank KA, Mason DM, Ulanowicz RE, Taylor WW (2003) Compartments revealed in food-web structure. Nature 426: 282–285.
- 14. Jonsson PF, Cavanna T, Zicha D, Bates PA (2006) Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis. BMC Bioinf 7: 2.
- 15. Holme P, Huss M, Jeong H (2003) Subnetwork hierarchies of biochemical pathways. Bioinformatics 19: 532–538.
- 16. Guimerà R, Amaral LAN (2005) Functional cartography of complex metabolic networks. Nature 433: 895–900.
- 17. Fortunato S (2010) Community detection in graphs. Physics Reports 486: 75–174.
- 18.
Baumes J, Goldberg MK, Krishnamoorthy MS, Ismail MM, Preston N (2005) Finding communities by clustering a graph into overlapping
subgraphs. In: Guimaraes N, Isaias PT, editors. IADIS AC. IADIS. pp. 97–104.
- 19. Palla G, Derényi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435: 814–818.
- 20. Zhang S, Wang RS, Zhang XS (2007) Identification of overlapping community structure in complex networks using fuzzy c-means clustering. Physica A 374: 483–490.
- 21.
Gregory S (2007) An algorithm to find overlapping community structure in
networks. Proceedings of the 11th European Conference on Principles and Practice
of Knowledge Discovery in Databases (PKDD 2007). Berlin, Germany: Springer-Verlag. pp. 91–102.
- 22. Nepusz T, Petróczi A, Négyessy L, Bazsó F (2008) Fuzzy communities and the concept of bridgeness in complex networks. Phys Rev E 77: 016107.
- 23. Lancichinetti A, Fortunato S, Kertész J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11: 033015.
- 24. Evans TS, Lambiotte R (2009) Line graphs, link partitions, and overlapping communities. Phys Rev E 80: 016105.
- 25. Kovács IA, Palotai R, Szalay MS, Csermely P (2010) Community landscapes: An integrative approach to determine overlapping network module hierarchy, identify key nodes and predict network dynamics. PLoS ONE 5: e12528.
- 26. Simon H (1962) The architecture of complexity. Proc Am Phil Soc 106: 467–482.
- 27. Sales-Pardo M, Guimerà R, Moreira AA, Amaral LAN (2007) Extracting the hierarchical organization of complex systems. Proc Natl Acad Sci USA 104: 15224–15229.
- 28.
Clauset A, Moore C, Newman MEJ (2007) Structural Inference of Hierarchies in Networks. In: Airoldi EM, Blei DM, Fienberg SE, Goldenberg A, Xing EP, et al., editors. Statistical Network Analysis: Models, Issues, and New
Directions. pp. 1–13. Springer, Berlin, Germany, volume 4503 of Lect. Notes Comp.
Sci.
- 29. Clauset A, Moore C, Newman MEJ (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453: 98–101.
- 30. Bianconi G, Pin P, Marsili M (2009) Assessing the relevance of node features for network structure. Proc Natl Acad Sci USA 106: 11433–11438.
- 31. Lancichinetti A, Radicchi F, Ramasco JJ (2010) Statistical significance of communities in networks. Phys Rev E 81: 046110.
- 32. Hopcroft J, Khan O, Kulis B, Selman B (2004) Tracking evolving communities in large linked networks. Proc Natl Acad Sci USA 101: 5249–5253.
- 33.
Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and
evolution. KDD '06: Proceedings of the 12th ACM SIGKDD international
conference on Knowledge discovery and data mining. New York, NY, USA: ACM. pp. 44–54.
- 34.
Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. KDD '06: Proceedings of the 12th ACM SIGKDD international
conference on Knowledge discovery and data mining. New York, NY, USA: ACM. pp. 554–560.
- 35. Palla G, Barabási AL, Vicsek T (2007) Quantifying social group evolution. Nature 446: 664–667.
- 36.
Asur S, Parthasarathy S, Ucar D (2007) An event-based framework for characterizing the evolutionary
behavior of interaction graphs. KDD '07: Proceedings of the 13th ACM SIGKDD international
conference on Knowledge discovery and data mining. New York, NY, USA: ACM. pp. 913–921.
- 37. Mucha PJ, Richardson T, Macon K, Porter MA, Onnela J (2010) Community Structure in Time-Dependent, Multiscale, and Multiplex Networks. Science 328: 876.
- 38. Radicchi F, Lancichinetti A, Ramasco JJ (2010) Combinatorial approach to modularity. Phys Rev E 82: 026102.
- 39. Molloy M, Reed B (1995) A critical point for random graphs with a given degree sequence. Random Struct Algor 6: 161–179.
- 40. Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69: 026113.
- 41. Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E 78: 046110.
- 42. Lancichinetti A, Fortunato S (2009) Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys Rev E 80: 016118.
- 43. Condon A, Karp RM (2001) Algorithms for graph partitioning on the planted partition model. Random Struct Algor 18: 116–140.
- 44. Albert R, Jeong H, Barabási AL (2000) Error and attack tolerance of complex networks. Nature 406: 378–382.
- 45. Newman MEJ (2004) Detecting community structure in networks. Eur Phys J B 38: 321–330.
- 46. Radicchi F, Castellano C, Cecconi F, Loreto V, Parisi D (2004) Defining and identifying communities in networks. Proc Natl Acad Sci USA 101: 2658–2663.
- 47. Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very large networks. Phys Rev E 70: 066111.
- 48. Lancichinetti A, Kivelä M, Saramäki J, Fortunato S (2010) Characterizing the community structure of complex networks. PLoS ONE 5: e11976.
- 49. Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci USA 105: 1118–1123.
- 50. Lancichinetti A, Fortunato S (2009) Community detection algorithms: A comparative analysis. Phys Rev E 80: 056117.
- 51. Danon L, Daz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech P09008:
- 52. Gregory S (2010) Finding overlapping communities in networks by label propagation. New Journal of Physics 12: 103018.
- 53. Raghavan UN, Albert R, Kumara S (2007) Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E 76: 036106.
- 54.
McDaid A, Hurley NJ (2010) Detecting highly overlapping communities with model-based
overlapping seed expansion. ASONAM 2010.
- 55. Nowicki K, Snijders TAB (2001) Estimation and Prediction for Stochastic Blockstructures. J Am Stat Assoc 96:
- 56. Rosvall M, Bergstrom CT (2010) Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems. Eprint arXiv: 10100431.
- 57. Erdös P, Rényi A (1959) On random graphs. I. Publ Math Debrecen 6: 290–297.
- 58. Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286: 509–512.
- 59.
Tan PN, Steinbach M, Kumar V (2005) Introduction to Data Mining. New York, USA: Addison Wesley, 1 edition.
- 60. Nelson DL, McEvoy CL, Schreiber TA (1998) The university of south florida word association, rhyme, and word fragment norms
- 61. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech P10008:
- 62. Leung IXY, Hui P, Liò P, Crowcroft J (2009) Towards real-time community detection in large networks. Phys Rev E 79: 066107.
- 63. Newman MEJ (2006) From the Cover: Modularity and community structure in networks. Proc Natl Acad Sci USA 103: 8577–8582.
- 64. Fortunato S, Barthélemy M (2007) Resolution limit in community detection. Proc Natl Acad Sci USA 104: 36–41.
- 65. Guimerà R, Sales-Pardo M, Amaral LA (2004) Modularity from fluctuations in random graphs and complex networks. Phys Rev E 70: 025101 (R).
- 66. Reichardt J, Bornholdt S (2006) When are networks truly modular? Physica D 224: 20–26.
- 67. Reichardt J, Bornholdt S (2007) Partitioning and modularity of graphs with arbitrary degree distribution. Phys Rev E 76: 015102 (R).
- 68. Good BH, de Montjoye YA, Clauset A (2010) Performance of modularity maximization in practical contexts. Phys Rev E 81: 046106.