Keywords
Disease Module Identification, Biological Networks, Community Detection, GWAS, Modularity, Conductance, PPI, Co-expression
This article is included in the Bioinformatics gateway.
Disease Module Identification, Biological Networks, Community Detection, GWAS, Modularity, Conductance, PPI, Co-expression
A variety of genomic data has been used to construct biological networks. Biological networks are scale free by nature1 and it is well-known that scale-free networks exhibit community like structure2–5. Community like structure in networks is equivalent to presence of a high degree of modularity5. In biological networks, the modules often comprise of genes or proteins that are involved in the same biological functions. Network module identification methods, commonly known as community detection4–9 and graph partitioning methods10–12, attempt to reveal these functional units2,13,14 which is key to derive biological insights from genomic networks15–18. However, the performance of different community detection methods using diverse parameter settings to uncover biologically relevant modules in myriad networks remain poorly understood because there has been no community effort to transparently evaluate module identification methods on common benchmarks and across diverse types of genomic networks. Thus, it is very difficult to objectively compare the strengths and limitations of alternative approaches. Evaluation of module identification methods typically relied either on random graphs13, which do not allow for assessment of biological relevance of modules, or on pre-annotated functional gene sets18 (e.g., gene ontology or molecular pathway databases such as KEGG), which are still primarily incomplete and biased towards well-studied pathways.
To address these issues, an open community DREAM challenge enabling comprehensive and rigorous assessment of module identification methods across a broad range of gene and protein networks was initiated. The task in sub-challenge 1 was to identify functional modules in 6 individual benchmark networks s.t. the module size satisfied the constraint: 3 ≤ modul esize ≤ 100. The predicted modules were evaluated based on data from disease-relevant genome-wide association studies (GWAS). GWAS have successfully identified thousands of genetic loci associated with a broad range of complex traits and diseases. The variants are mapped to genes allowing to ask whether specific network modules are enriched in these genes19. The DREAM challenge organizers employed a comprehensive collection of over 200 GWAS datasets, thereby, covering a broad spectrum of functional units, many of which have not been annotated previously.
In this paper, we focus on sub-challenge 1 where the goal is to predict functional modules for individual anonymized networks across a broad range of gene and protein networks. Our proposed pipeline requires a hierarchical tree from any state-of-the-art hierarchical community detection technique as input. The pipeline first identifies the optimal level of hierarchy using an F-score comprising of quality metrics like conductance13, modularity2 and connectivity1. Then it traverses the hierarchy bottom-up from the optimal level allowing to merge smaller communities based on the weighted connectivity criterion as long as they fit the size constraint. Further, it splits giant connected components (modulesize > 100).
For each giant connected component, we re-build the hierarchical tree using a linkage based agglomerative hierarchical technique and identify the optimal cut (number of clusters k) using the proposed F-score criterion. Finally, we propose a metric to indicate the confidence in each module among the final set of detected modules and develop a method to automatically select the right confidence threshold to prune less meaningful modules. Figure 1 depicts the proposed pipeline for the constrained disease module identification problem.
The disease module identification methods were evaluated using 6 benchmark networks. Details of the networks are provided in Table 1.
There are several preprocessing steps performed before the input network can be processed by the pipeline. The node IDs are mapped to a continuous set of integers starting from 1. If the aforementioned procedure is not performed, the network will end up with several isolated nodes and missing IDs. All the edge-weights in each network are normalized between 0 and 1. The input network are considered weighted and undirected in all our pipeline.
We experimented with removal of edges with a weight lower than a threshold t = 0.05 but observed that the corresponding results deteriorated. Hence, we recommend keeping all the edges in the network.
In the initial submission rounds, we ran several out-ofthe-box state-of-the-art community detection techniques including Order Statistics Local Optimization Method (OSLOM)4, Louvain5, Multi-level Hieararical Kernel Spectral Clustering (MHKSC)6,7, Dynamic Tree Cut20 and METIS10. We also tried to use the results obtained from these methods as input to consensus clustering based method PCAgglo21 and ensemble clustering based method Ensemble-Clue22 which are evaluated using complex traits and disease modules in 76 European GWAS datasets.
OSLOM is based on the local optimization of a fitness function expressing the statistical significance of communities with regards to random fluctuations, which is estimated with tools of extreme and order Statistics. The Louvain method is a greedy optimization method that attempts to optimize the modularity of a network partition. MHKSC technique uses a kernel spectral clustering formulation to random walk and exploits the structure of the projections in the eigenspace to automatically determine a set of increasing distance thresholds. It then uses these distance thresholds in a test phase to obtain multiple levels of hierarchy using principles of agglomerative hierarchical clustering. Dynamic Tree Cut method implements novel dynamic branch cutting technique for hierarchical clustering where it detects clusters in a dendogram depending on their shape. They are capable of identifying nested clusters, can identify clusters of various shape and are suitable for automation. METIS is a set of serial programs for multilevel recursive partitioning of the graph to produce fill reducing orderings for sparse matrices. PCAgglo performs logistic PCA on the concatenated node membership matrix formed from k different methods and then agglomerative hierarchical clustering is performed on the principal components. For METIS, Dynamic Tree Cut, PCAgglo and Ensemble-Clue, we selected that level of hierarchy for which the average module size was close to the best as per the exploratory data analysis provided by the DREAM Challenge organizers. The results that we obtained by direct application of out-ofthe-box state-of-the-art community detection methods is depicted in Table 2.
The Best of All result were not submitted during the preliminary rounds of the Challenge because the Best of All method depicts the maximum number of enriched modules that can be identified by a simple ‘max’ combination of these techniques at default settings. However, as per our understanding the goal of the challenge is to develop a method or a generic framework which can optimally identify disease modules from various gene and protein interaction networks at different parameter settings. We gained several insights from these preliminary results including:
Methods like OSLOM, MHKSC and PCAgglo generated a set of clusters whose cluster size distribution is nearly power law.
For most of these methods there were several giant connected components which were ignored due to the strict upper bound constraint on the module size.
For most of these methods nearly half of the nodes in each network were part of giant connected components that were removed due to size constraint.
METIS generated uniform sized clusters and included most of the nodes in each network, hence can’t be optimized further.
We didn’t get more enriched modules from a consensus (PCAgglo) or ensemble (ensemble-clue) based clustering methods.
Let G(V, E) be an undirected graph with n = |V| representing number of nodes and m = |E| representing number of edges. Let S be the set of modules (or a partition of the network), where ns is the number of nodes in a module s ∈ S; ms be the number of edges in s i.e. ms = |(u, v) ∈ E : u ∈ s, v ∈ s| and cs be the number of edges on the boundary of s i.e. cs = |(u, v) ∈ E : u ∈ s, v ∉ s| and d(u) is the degree of node u.
We provide summary of quality metrics used and definition of proposed quality metrics below:
1. Modularity: Modularity is a global metric which takes value between −1 and 1. It measures the density of links inside communities compared to links between communities. For a weighted graph, modularity of a network partition is defined as: where ms−E(ms) is the difference between ms, the number of edges between nodes in s and E(ms), the expected number of such edges in a random graph with identical degree sequence. Modularity value ≤ 0 indicate that the corresponding partition behaves worse than a random partition of the network. Modularity score can only be obtained for graph partitions.
2. Conductance: Conductance is a local quality metric which can defined for each individual community in the network and takes values between 0 and 0.5. It is defined as: It measures the fraction of total volume of edges associated with the nodes in a module s ∈ S pointing outside the cluster. Conductance for a partition S can be calculated by taking an average of the conductance values for all modules s ∈ S.
3. Connectivity: Connectivity is a sub-local quality metric which can be defined for each individual node in the network and can be averaged for all the nodes in a module s (considering connectivity to other nodes in the same community) to get local connectivity metric. It can be averaged for all the modules s ∈ S to obtain the global connectivity CN(S) for a partition S. It was used in 1 to evaluate whether genes perturbed by trait-associated variants are more densely interconnected than expected in complex diseases and generate connectivity enrichment curves. The connectivity matrix K is defined as: with where K is p-step random walk kernel used to define pairwise connectivity between the nodes, I is an identity matrix, W is the weighted adjacency matrix and D is the weighted diagonal degree matrix. We set p = 4 for our biological networks as it allows to capture all meaningful interactions for paths of length ≤ 4. The connectivity of a node i is estimated as connectivity of module s is and connectivity of partition S as is Here |⋅| represents the cardinality function.
4. F-score: We need a quality metric which evaluates the quality of a partition using modularity, conductance and connectivity. While modularity captures global information, conductance and connectivity can capture local information. The proposed quality metric is defined as:
Higher value of modularity indicates better quality clusters, lower value of conductance leads to good quality communities and higher value of connectivity indicate better quality of modules. We need to maximize modularity and connectivity while minimizing conductance. Hence, we take harmonic mean of modularity and connectivity in the denominator of F-score metric to give importance to both of the quality metrics. Thus, with conductance in the numerator, the minimum value of F-score corresponds to the partition S with best quality cluster. However, if modularity value is ≤ 0 then we set F-score to a very large value which depicts the poor quality of the partition.
5. Inverse Confidence: We need a metric to rank all the modules generated from the proposed framework. We first considered the average connectivity metric CN(s) for a community s. However, the connectivity criterion prefers smaller size modules which tend to be more cliquish than bigger modules. We also considered using the conductance CC(s) of a community s to rank all the modules in partition S. However, conductance value decreases as size of the community increases due to larger volume of the module (which is denominator of CC(⋅)). We propose an inverse confidence metric to rank all the communities in a partition S as: We utilized the Inverse Confidence metric in conjugation with modularity to remove out less meaningful communities as illustrated in Figure 2 and explained within the proposed framework. We finally convert the inverse confidence value of each module into a confidence score as: where the denominator is used for normalization.
We followed the steps indicated in Figure 1 to build the proposed framework for constrained disease module identification.
1. Given an input network we perform the preprocessing step to create a modified input network where the node IDs are monotonically increasing, edge weights are noramlized, and the network becomes weighted and undirected.
2. Run a state-of-the-art hierarchical community detection technique to generate the hierarchical tree structure.
3. Estimate quality of each level of hierarchy using modularity, conductance and connectivity.
4. Select that level of hierarchy for which the F-score is minimum.
5. For communities of size > 100 go to Step 2 until either the constraint exceeding communities cannot be split further or modularity of resulting cluster memberships becomes very poor.
6. In the merge step, we start with the partition (S) at the best level of hierarchy and traverse the hierarchical tree from that level in a bottom-up fashion. We iteratively merge those communities whose weighted mean connectivity score is less than the connectivity score for a module at next level of hierarchy where the module consists of those previous communities i.e.a. Here p an q are modules at level h − 1 and s is community at level h such that p, q ∈ s. This results in an intermediate partition set or a set of modules.
7. We then consider all the communities s s.t. ns > 100. For each such community s, we consider the sub-graph comprising only the nodes from that community. We transform the corresponding weighted adjacency matrix i.e. Ŵ = W(vi,vj), ∀vi, vj ∈ s into a distance matrix DŴ = 1 − Ŵ. We then build the agglomerative hierarchical tree using the linkage clustering with Ward’s distance.
8. For each community s (ns > 100), once we obtain the agglomerative hierarchical tree, we cut the tree for different values of k i.e. the number of clusters. We evaluate each such partition using the F-score and select that partition which has the minimum positive F-score.
9. Using Steps 6–7 on these bigger modules and the small size communities which satisfy the size constraint, we generate another set of intermediate clusters.
10. We rank this intermediate set of communities using the inverse confidence score i.e. IC(s), ∀s ∈ S. Lower inverse confidence corresponds to higher rank. We now remove all modules whose size exceeds the size constraint i.e. ns ≤ 3 and ns ≥ 100.
11. In this final step, we propose a mechanism to select the best set of modules for evaluation in an automated fashion independent of the network. We can calculate the maximum and minimum value of inverse confidence (IC) from the inverse confidence (IC) scores of all the communities in the intermediate partition S. We iteratively decrease the inverse confidence threshold from maximum to minimum thereby pruning clusters. At each such threshold, we calculate the modularity of the remaining set of partition using the subgraph corresponding to this partition S′ i.e. GS′. We select the threshold where the difference between Q(S′, θ) and Qprev is minimum i.e. argθ min|Q(S′, θ)−Qprev|. Here |⋅| represents the absolute value, Qprev is the modularity of the partition obtained at Step 2 and calculated in Step 3. For the final submission, we consider all the modules in the optimum partition i.e. s ∈ S′ obtained by pruning communities whose IC (s) ≥ θ .
For our final submission, we utilized the method which is the fastest and most suitable for hierarchical graph partitioning i.e. Louvain method5 as we were allowed to make only 1 submission. We formulated a recursive version of Louvain method where communities of size greater than 100 were recursively partitioned. We also designed a constraint satisfying version of MHKSC6,7 and compared its performance with the recursive Louvain method within the proposed generic framework. The evaluation criterion used in the Challenge was the total number of significant modules identified in the 6 benchmark networks on a hold-out set of 104 GWAS datasets at the false discovery rate (FDR) cut-off23 of 0.05 for multiple testing. We compare the results obtained from proposed generic framework using both the Louvain and MHKSC methods with the winners of the DREAM Challenge in Table 3.
Method | FDR Cutoff | N | ns | 1_ppi | 2_ppi | 3_signal | 4_coexpr | 5_cancer | 6_homology |
---|---|---|---|---|---|---|---|---|---|
Double Spectral Clustering* | 0.05 | 2407 | 60 | 16 | 13 | 9 | 12 | 5 | 5 |
Resolution Adjusted Clustering* | 0.05 | 2780 | 60 | 19 | 11 | 5 | 14 | 7 | 4 |
Constrained Louvain | 0.05 | 1965 | 42 | 12 | 3 | 7 | 15 | 5 | 0 |
Constrained MHKSC | 0.05 | 2108 | 37 | 5 | 3 | 4 | 18 | 4 | 3 |
From Table 3, we observe that the winners (Double Spectral Clustering and Resolution Adjusted Clustering) perform far better than Constrained Louvain method on the protein-protein interaction networks (Networks 1 and 2) and homology network (Network 6). However, for the signaling, co-expression and the cancer networks (Networks 3, 4 and 5), proposed Constrained Louvain method has comparable performance with the winners of the challenge. To gain a sense of the robustness of the ranking with respect to the final GWAS data, the challenge organizers sub-sampled the hold-out set by drawing 76 GWASs (same number as during the preliminary phase) out of the 104 GWAS datasets. They created 1, 000 subsamples of the hold-out set. The methods were then scored on each subsample (Sub-sampling was done here without replacement.)
The performance of each competing method t for a given network was compared to the highest scoring method across the sub-samples by the paired Bayes factor Bt i.e. the method with the highest score on this network in the hold-out set (all 104 GWASs) was defined as reference. The score ns(t, b) of method t in subsample b was thus compared with the score ns(ref , b) of the reference method in the same subsample b. The Bayes factor Bt is defined as the number of times the reference method outperforms method t, divided by the number of times method t outperforms or ties the reference method over all subsamples. Methods with Bt < 4 were considered a tie with the reference method (i.e., method t outperforms the reference in more than 1 out of 5 subsamples). For networks 3, 4 and 5, the Bayes factor of proposed Constrained Louvain method was less than 4. This indicates that the proposed generic framework, though not the winner, is useful, generic and robust enough for identification of statistically significant disease modules in biological networks.
With the availability of the de-anonymized version of the networks along with the scoring tools used during the competition, we were able to perform additional experiments for the Constrained Louvain method. After the challenge, we identified an error in labeling the nodes in the significant disease modules that we submitted for the homology network (Network 6) during the competition. After correcting the labeling error, we identified 2 significant disease modules from Network 6.
Moreover, we performed additional analysis using 5 different FDR cut-offs (multiple testing) for each of the 6 benchmark networks to obtain the trends in the number of significant disease modules identified by the proposed generic framework for these cut-offs. This result is depicted in Figure 3. The FDR cut-off used as evaluation criterion during the competition was 0.05.
The DREAM Challenge organizers made the GWAS datasets along with de-anonymized networks available to the challenge participants. This allowed us to further analyze our results. For each benchmark network, we identified the proteins or genes that make up the significant disease modules.
We investigated association of identified disease modules with disease/trait of the provided GWAS datasets. We used the official competition FDR cut-off of 0.05 as significance threshold to identify disease modules for each benchmark network. Table 4–Table 9 provides a detailed analysis of the significant modules and their corresponding associated disease (inferred from 104 hold-out GWAS datasets) for Networks 1,2,3,4,5 and 6 respectively. Each module is found to be associated with at least two GWAS datasets of the corresponding disease/trait. Moreover, many modules were found associated with multiple disease/trait of similar nature. For example, as shown in Table 4, module 19 in 1_ppi network is found to be associated with anthropometric traits. This indicates that the identified modules correspond to preserved biological functions of genes/proteins.
The Challenge datasets for registered participants are available at: https://www.synapse.org/#!Synapse:syn6156761/wiki/400659. Challenge documentation, including the detailed description of the Challenge design, overall results and scoring scripts can be found at: https://www.synapse.org/#!Synapse:syn6156761/wiki/400647.
Source code for the proposed framework is available from: https://github.com/raghvendra5688/DMI/tree/DMI_v1.0
The archived source code for the proposed framework along with a README file can be found at: https://doi.org/10.5281/zenodo.119742424.
We would like to thank Ms. Kanchan Karnani for helping out with preparing the flowchart (Figure 1).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new method (or application) clearly explained?
Partly
Is the description of the method technically sound?
Partly
Are sufficient details provided to allow replication of the method development and its use by others?
Partly
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Transcription regulation, network biology, cancer biology
Is the rationale for developing the new method (or application) clearly explained?
Partly
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
No
Competing Interests: I note that I am co-founder and a board member of Sage Bionetworks, the institution that ran the challenge upon which the results of this paper are based.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 26 Mar 18 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)