Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity

[version 1; peer review: 2 approved with reservations]
PUBLISHED 26 Mar 2018
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

Disease processes are usually driven by several genes interacting in molecular modules or pathways leading to the disease. The identification of such modules in gene or protein networks is the core of computational methods in biomedical research. With this pretext, the Disease Module Identification (DMI) DREAM Challenge was initiated as an effort to systematically assess module identification methods on a panel of 6 diverse genomic networks. In this paper, we propose a generic refinement method based on ideas of merging and splitting the hierarchical tree obtained from any community detection technique for constrained DMI in biological networks. The only constraint was that size of community is in the range [3, 100]. We propose a novel model evaluation metric, called F-score, computed from several unsupervised quality metrics like modularity, conductance and connectivity to determine the quality of a graph partition at given level of hierarchy. We also propose a quality measure, namely Inverse Confidence, which ranks and prune insignificant modules to obtain a curated list of candidate disease modules (DM) for biological network. The predicted modules are evaluated on the basis of the total number of unique candidate modules that are associated with complex traits and diseases from over 200 genome-wide association study (GWAS) datasets. During the competition, we identified 42 modules, ranking 15th at the official false detection rate (FDR) cut-off of 0.05 for identifying statistically significant DM in the 6 benchmark networks. However, for stringent FDR cut-offs 0.025 and 0.01, the proposed method identified 31 (rank 9) and 16 DMIs (rank 10) respectively. From additional analysis, our proposed approach detected a total of 44 DM in the networks in comparison to 60 for the winner of DREAM Challenge. Interestingly, for several individual benchmark networks, our performance was better or competitive with the winner.

Keywords

Disease Module Identification, Biological Networks, Community Detection, GWAS, Modularity, Conductance, PPI, Co-expression

Motivation & background

A variety of genomic data has been used to construct biological networks. Biological networks are scale free by nature1 and it is well-known that scale-free networks exhibit community like structure25. Community like structure in networks is equivalent to presence of a high degree of modularity5. In biological networks, the modules often comprise of genes or proteins that are involved in the same biological functions. Network module identification methods, commonly known as community detection49 and graph partitioning methods1012, attempt to reveal these functional units2,13,14 which is key to derive biological insights from genomic networks1518. However, the performance of different community detection methods using diverse parameter settings to uncover biologically relevant modules in myriad networks remain poorly understood because there has been no community effort to transparently evaluate module identification methods on common benchmarks and across diverse types of genomic networks. Thus, it is very difficult to objectively compare the strengths and limitations of alternative approaches. Evaluation of module identification methods typically relied either on random graphs13, which do not allow for assessment of biological relevance of modules, or on pre-annotated functional gene sets18 (e.g., gene ontology or molecular pathway databases such as KEGG), which are still primarily incomplete and biased towards well-studied pathways.

To address these issues, an open community DREAM challenge enabling comprehensive and rigorous assessment of module identification methods across a broad range of gene and protein networks was initiated. The task in sub-challenge 1 was to identify functional modules in 6 individual benchmark networks s.t. the module size satisfied the constraint: 3 ≤ modul esize ≤ 100. The predicted modules were evaluated based on data from disease-relevant genome-wide association studies (GWAS). GWAS have successfully identified thousands of genetic loci associated with a broad range of complex traits and diseases. The variants are mapped to genes allowing to ask whether specific network modules are enriched in these genes19. The DREAM challenge organizers employed a comprehensive collection of over 200 GWAS datasets, thereby, covering a broad spectrum of functional units, many of which have not been annotated previously.

In this paper, we focus on sub-challenge 1 where the goal is to predict functional modules for individual anonymized networks across a broad range of gene and protein networks. Our proposed pipeline requires a hierarchical tree from any state-of-the-art hierarchical community detection technique as input. The pipeline first identifies the optimal level of hierarchy using an F-score comprising of quality metrics like conductance13, modularity2 and connectivity1. Then it traverses the hierarchy bottom-up from the optimal level allowing to merge smaller communities based on the weighted connectivity criterion as long as they fit the size constraint. Further, it splits giant connected components (modulesize > 100).

For each giant connected component, we re-build the hierarchical tree using a linkage based agglomerative hierarchical technique and identify the optimal cut (number of clusters k) using the proposed F-score criterion. Finally, we propose a metric to indicate the confidence in each module among the final set of detected modules and develop a method to automatically select the right confidence threshold to prune less meaningful modules. Figure 1 depicts the proposed pipeline for the constrained disease module identification problem.

cdb495da-e407-4b1f-b75f-336a49fa2a7a_figure1.gif

Figure 1. Steps involved in proposed generic constrained disease module identification framework.

Methods

Data

The disease module identification methods were evaluated using 6 benchmark networks. Details of the networks are provided in Table 1.

Table 1. Description of 6 benchmark networks used for evaluation.

IDDirected#Nodes#EdgesTypeEdge-Weight
1_ppiNo17, 3972, 232, 405Protein-protein interaction networkConfidence score
2_ppiNo12, 420397, 309Protein-protein interaction networkConfidence score
3_signalYes5, 25421, 826Signaling networkConfidence score
4_coexprNo12, 5881, 000, 000Co-expression networkCorrelation
5_cancerNo14, 6791, 000, 000Connects genes essential for similar
tumor types
Correlation
6_homologyNo10, 4054, 223, 606Connects genes that are evolutionarily
related
Confidence score

Preprocessing

There are several preprocessing steps performed before the input network can be processed by the pipeline. The node IDs are mapped to a continuous set of integers starting from 1. If the aforementioned procedure is not performed, the network will end up with several isolated nodes and missing IDs. All the edge-weights in each network are normalized between 0 and 1. The input network are considered weighted and undirected in all our pipeline.

We experimented with removal of edges with a weight lower than a threshold t = 0.05 but observed that the corresponding results deteriorated. Hence, we recommend keeping all the edges in the network.

Preliminary experiments

In the initial submission rounds, we ran several out-ofthe-box state-of-the-art community detection techniques including Order Statistics Local Optimization Method (OSLOM)4, Louvain5, Multi-level Hieararical Kernel Spectral Clustering (MHKSC)6,7, Dynamic Tree Cut20 and METIS10. We also tried to use the results obtained from these methods as input to consensus clustering based method PCAgglo21 and ensemble clustering based method Ensemble-Clue22 which are evaluated using complex traits and disease modules in 76 European GWAS datasets.

OSLOM is based on the local optimization of a fitness function expressing the statistical significance of communities with regards to random fluctuations, which is estimated with tools of extreme and order Statistics. The Louvain method is a greedy optimization method that attempts to optimize the modularity of a network partition. MHKSC technique uses a kernel spectral clustering formulation to random walk and exploits the structure of the projections in the eigenspace to automatically determine a set of increasing distance thresholds. It then uses these distance thresholds in a test phase to obtain multiple levels of hierarchy using principles of agglomerative hierarchical clustering. Dynamic Tree Cut method implements novel dynamic branch cutting technique for hierarchical clustering where it detects clusters in a dendogram depending on their shape. They are capable of identifying nested clusters, can identify clusters of various shape and are suitable for automation. METIS is a set of serial programs for multilevel recursive partitioning of the graph to produce fill reducing orderings for sparse matrices. PCAgglo performs logistic PCA on the concatenated node membership matrix formed from k different methods and then agglomerative hierarchical clustering is performed on the principal components. For METIS, Dynamic Tree Cut, PCAgglo and Ensemble-Clue, we selected that level of hierarchy for which the average module size was close to the best as per the exploratory data analysis provided by the DREAM Challenge organizers. The results that we obtained by direct application of out-ofthe-box state-of-the-art community detection methods is depicted in Table 2.

Table 2. Preliminary results using several state-of-the-art hierarchical module identification techniques.

Comparison of several out-of-the-box community detection methods along with one consensus and one ensemble based clustering method for disease module identification on 6 different biological networks. Here N represents total number of candidate disease modules and ns represents the number of significant/detected disease modules in the 76 genome wide association study (GWAS) datasets. OSLAM - Order Statistics Local Optimization Method, MHKSC - Multi-level Hieararical Kernel Spectral Clustering.

MethodNns1_ppi2_ppi3_signal4_coexpr5_cancer6_homology
OSLOM842286161041
Louvain83329657155
MHKSC707311133563
METIS120930885432
Dynamic Tree
Cut
211824946302
PCAagglo180324832146
Ensemble-Clue75621943311
Best of All-4811871066

Insights gained

The Best of All result were not submitted during the preliminary rounds of the Challenge because the Best of All method depicts the maximum number of enriched modules that can be identified by a simple ‘max’ combination of these techniques at default settings. However, as per our understanding the goal of the challenge is to develop a method or a generic framework which can optimally identify disease modules from various gene and protein interaction networks at different parameter settings. We gained several insights from these preliminary results including:

  • Methods like OSLOM, MHKSC and PCAgglo generated a set of clusters whose cluster size distribution is nearly power law.

  • For most of these methods there were several giant connected components which were ignored due to the strict upper bound constraint on the module size.

  • For most of these methods nearly half of the nodes in each network were part of giant connected components that were removed due to size constraint.

  • METIS generated uniform sized clusters and included most of the nodes in each network, hence can’t be optimized further.

  • We didn’t get more enriched modules from a consensus (PCAgglo) or ensemble (ensemble-clue) based clustering methods.

Notations

Let G(V, E) be an undirected graph with n = |V| representing number of nodes and m = |E| representing number of edges. Let S be the set of modules (or a partition of the network), where ns is the number of nodes in a module sS; ms be the number of edges in s i.e. ms = |(u, v) ∈ E : us, vs| and cs be the number of edges on the boundary of s i.e. cs = |(u, v) ∈ E : us, v s| and d(u) is the degree of node u.

Quality metrics

We provide summary of quality metrics used and definition of proposed quality metrics below:

  • 1. Modularity: Modularity is a global metric which takes value between −1 and 1. It measures the density of links inside communities compared to links between communities. For a weighted graph, modularity of a network partition is defined as: Q(S)=14sS(msE(ms)), where msE(ms) is the difference between ms, the number of edges between nodes in s and E(ms), the expected number of such edges in a random graph with identical degree sequence. Modularity value ≤ 0 indicate that the corresponding partition behaves worse than a random partition of the network. Modularity score can only be obtained for graph partitions.

  • 2. Conductance: Conductance is a local quality metric which can defined for each individual community in the network and takes values between 0 and 0.5. It is defined as: CC(s)=ms2ms+cs. It measures the fraction of total volume of edges associated with the nodes in a module sS pointing outside the cluster. Conductance for a partition S can be calculated by taking an average of the conductance values for all modules sS.

  • 3. Connectivity: Connectivity is a sub-local quality metric which can be defined for each individual node in the network and can be averaged for all the nodes in a module s (considering connectivity to other nodes in the same community) to get local connectivity metric. It can be averaged for all the modules sS to obtain the global connectivity CN(S) for a partition S. It was used in 1 to evaluate whether genes perturbed by trait-associated variants are more densely interconnected than expected in complex diseases and generate connectivity enrichment curves. The connectivity matrix K is defined as: K=(I+W^)p; with W^=D12WD12, where K is p-step random walk kernel used to define pairwise connectivity between the nodes, I is an identity matrix, W is the weighted adjacency matrix and D is the weighted diagonal degree matrix. We set p = 4 for our biological networks as it allows to capture all meaningful interactions for paths of length ≤ 4. The connectivity of a node i is estimated as CN(vi)=jKij, connectivity of module s is CN(s)=i,jsKijns2, and connectivity of partition S as CN(S)=sSCN(s)|S|. is Here |⋅| represents the cardinality function.

  • 4. F-score: We need a quality metric which evaluates the quality of a partition using modularity, conductance and connectivity. While modularity captures global information, conductance and connectivity can capture local information. The proposed quality metric is defined as: F(S)=CC(S)Q(S)+CN(S)2Q(S)CN(S).

    Higher value of modularity indicates better quality clusters, lower value of conductance leads to good quality communities and higher value of connectivity indicate better quality of modules. We need to maximize modularity and connectivity while minimizing conductance. Hence, we take harmonic mean of modularity and connectivity in the denominator of F-score metric to give importance to both of the quality metrics. Thus, with conductance in the numerator, the minimum value of F-score corresponds to the partition S with best quality cluster. However, if modularity value is ≤ 0 then we set F-score to a very large value which depicts the poor quality of the partition.

  • 5. Inverse Confidence: We need a metric to rank all the modules generated from the proposed framework. We first considered the average connectivity metric CN(s) for a community s. However, the connectivity criterion prefers smaller size modules which tend to be more cliquish than bigger modules. We also considered using the conductance CC(s) of a community s to rank all the modules in partition S. However, conductance value decreases as size of the community increases due to larger volume of the module (which is denominator of CC(⋅)). We propose an inverse confidence metric to rank all the communities in a partition S as: IC(s)=CC(s)CN(s)×ns2. We utilized the Inverse Confidence metric in conjugation with modularity to remove out less meaningful communities as illustrated in Figure 2 and explained within the proposed framework. We finally convert the inverse confidence value of each module into a confidence score as: Conf(s)=1IC(s)argsmax(IC(s)), where the denominator is used for normalization.

cdb495da-e407-4b1f-b75f-336a49fa2a7a_figure2.gif

Figure 2. Figure 2 showcases the modularity values for different partitions obtained at various inverse confidence thresholds for network 3_signal.

Here we also highlight the optimal inverse confidence threshold value.

Proposed generic framework

We followed the steps indicated in Figure 1 to build the proposed framework for constrained disease module identification.

  • 1. Given an input network we perform the preprocessing step to create a modified input network where the node IDs are monotonically increasing, edge weights are noramlized, and the network becomes weighted and undirected.

  • 2. Run a state-of-the-art hierarchical community detection technique to generate the hierarchical tree structure.

  • 3. Estimate quality of each level of hierarchy using modularity, conductance and connectivity.

  • 4. Select that level of hierarchy for which the F-score is minimum.

  • 5. For communities of size > 100 go to Step 2 until either the constraint exceeding communities cannot be split further or modularity of resulting cluster memberships becomes very poor.

  • 6. In the merge step, we start with the partition (S) at the best level of hierarchy and traverse the hierarchical tree from that level in a bottom-up fashion. We iteratively merge those communities whose weighted mean connectivity score is less than the connectivity score for a module at next level of hierarchy where the module consists of those previous communities i.e.a. Here p an q are modules at level h − 1 and s is community at level h such that p, qs. This results in an intermediate partition set or a set of modules.

  • 7. We then consider all the communities s s.t. ns > 100. For each such community s, we consider the sub-graph comprising only the nodes from that community. We transform the corresponding weighted adjacency matrix i.e. Ŵ = W(vi,vj), ∀vi, vjs into a distance matrix DŴ = 1 − Ŵ. We then build the agglomerative hierarchical tree using the linkage clustering with Ward’s distance.

  • 8. For each community s (ns > 100), once we obtain the agglomerative hierarchical tree, we cut the tree for different values of k i.e. the number of clusters. We evaluate each such partition using the F-score and select that partition which has the minimum positive F-score.

  • 9. Using Steps 6–7 on these bigger modules and the small size communities which satisfy the size constraint, we generate another set of intermediate clusters.

  • 10. We rank this intermediate set of communities using the inverse confidence score i.e. IC(s), ∀sS. Lower inverse confidence corresponds to higher rank. We now remove all modules whose size exceeds the size constraint i.e. ns ≤ 3 and ns ≥ 100.

  • 11. In this final step, we propose a mechanism to select the best set of modules for evaluation in an automated fashion independent of the network. We can calculate the maximum and minimum value of inverse confidence (IC) from the inverse confidence (IC) scores of all the communities in the intermediate partition S. We iteratively decrease the inverse confidence threshold from maximum to minimum thereby pruning clusters. At each such threshold, we calculate the modularity of the remaining set of partition using the subgraph corresponding to this partition S′ i.e. GS′. We select the threshold where the difference between Q(S′, θ) and Qprev is minimum i.e. argθ min|Q(S′, θ)−Qprev|. Here |⋅| represents the absolute value, Qprev is the modularity of the partition obtained at Step 2 and calculated in Step 3. For the final submission, we consider all the modules in the optimum partition i.e. sS′ obtained by pruning communities whose IC (s) ≥ θ .

Results

For our final submission, we utilized the method which is the fastest and most suitable for hierarchical graph partitioning i.e. Louvain method5 as we were allowed to make only 1 submission. We formulated a recursive version of Louvain method where communities of size greater than 100 were recursively partitioned. We also designed a constraint satisfying version of MHKSC6,7 and compared its performance with the recursive Louvain method within the proposed generic framework. The evaluation criterion used in the Challenge was the total number of significant modules identified in the 6 benchmark networks on a hold-out set of 104 GWAS datasets at the false discovery rate (FDR) cut-off23 of 0.05 for multiple testing. We compare the results obtained from proposed generic framework using both the Louvain and MHKSC methods with the winners of the DREAM Challenge in Table 3.

Table 3. Final submission results comparing the winners with proposed generic framework.

Here the proposed generic frameworks are referred as Constrained Louvain and Constrained Multi-level Hieararical Kernel Spectral Clustering (MHKSC) and we use * to represent the winners of the competition. Here N represents total number of candidate disease modules and ns represents the total number of significant disease modules identified in the104 genome wide association study (GWAS) datasets. In the final round of the challenge, we submitted the results corresponding to Constrained Louvain method.

MethodFDR CutoffNns1_ppi2_ppi3_signal4_coexpr5_cancer6_homology
Double Spectral
Clustering*
0.05240760161391255
Resolution Adjusted
Clustering*
0.05278060191151474
Constrained Louvain0.0519654212371550
Constrained MHKSC0.052108375341843

From Table 3, we observe that the winners (Double Spectral Clustering and Resolution Adjusted Clustering) perform far better than Constrained Louvain method on the protein-protein interaction networks (Networks 1 and 2) and homology network (Network 6). However, for the signaling, co-expression and the cancer networks (Networks 3, 4 and 5), proposed Constrained Louvain method has comparable performance with the winners of the challenge. To gain a sense of the robustness of the ranking with respect to the final GWAS data, the challenge organizers sub-sampled the hold-out set by drawing 76 GWASs (same number as during the preliminary phase) out of the 104 GWAS datasets. They created 1, 000 subsamples of the hold-out set. The methods were then scored on each subsample (Sub-sampling was done here without replacement.)

The performance of each competing method t for a given network was compared to the highest scoring method across the sub-samples by the paired Bayes factor Bt i.e. the method with the highest score on this network in the hold-out set (all 104 GWASs) was defined as reference. The score ns(t, b) of method t in subsample b was thus compared with the score ns(ref , b) of the reference method in the same subsample b. The Bayes factor Bt is defined as the number of times the reference method outperforms method t, divided by the number of times method t outperforms or ties the reference method over all subsamples. Methods with Bt < 4 were considered a tie with the reference method (i.e., method t outperforms the reference in more than 1 out of 5 subsamples). For networks 3, 4 and 5, the Bayes factor of proposed Constrained Louvain method was less than 4. This indicates that the proposed generic framework, though not the winner, is useful, generic and robust enough for identification of statistically significant disease modules in biological networks.

With the availability of the de-anonymized version of the networks along with the scoring tools used during the competition, we were able to perform additional experiments for the Constrained Louvain method. After the challenge, we identified an error in labeling the nodes in the significant disease modules that we submitted for the homology network (Network 6) during the competition. After correcting the labeling error, we identified 2 significant disease modules from Network 6.

Moreover, we performed additional analysis using 5 different FDR cut-offs (multiple testing) for each of the 6 benchmark networks to obtain the trends in the number of significant disease modules identified by the proposed generic framework for these cut-offs. This result is depicted in Figure 3. The FDR cut-off used as evaluation criterion during the competition was 0.05.

cdb495da-e407-4b1f-b75f-336a49fa2a7a_figure3.gif

Figure 3. Number of disease modules identified by Constrained Louvain method for different false discovery rate (FDR) cut-offs for 6 benchmark networks.

Discussion

The DREAM Challenge organizers made the GWAS datasets along with de-anonymized networks available to the challenge participants. This allowed us to further analyze our results. For each benchmark network, we identified the proteins or genes that make up the significant disease modules.

We investigated association of identified disease modules with disease/trait of the provided GWAS datasets. We used the official competition FDR cut-off of 0.05 as significance threshold to identify disease modules for each benchmark network. Table 4Table 9 provides a detailed analysis of the significant modules and their corresponding associated disease (inferred from 104 hold-out GWAS datasets) for Networks 1,2,3,4,5 and 6 respectively. Each module is found to be associated with at least two GWAS datasets of the corresponding disease/trait. Moreover, many modules were found associated with multiple disease/trait of similar nature. For example, as shown in Table 4, module 19 in 1_ppi network is found to be associated with anthropometric traits. This indicates that the identified modules correspond to preserved biological functions of genes/proteins.

Table 4. Significant disease/trait modules identified for 1_ppi network by proposed Constrained Louvain method.

Module
Id
Disease/Trait
(Number of GWAS
datasets)
Genes/Proteins
19Hip Circumference(3),
Human
Height(4), Waist
Circumference(3)
C5orf24, LOX, FBN1, ADAMTSL2, ECM1, FBLN5, MFAP5, EFEMP2, MFAP3, ELN, LTBP2, FBN2, MFAP4,
ADAMTSL5, PXDNL, MFAP2, FBLN1, PRSS35, LOXL1, FBLN2, EFEMP1, PXDN
54Ulcerative Colitis(2)AMIGO2, AMIGO1, AMIGO3
56Coronary Artery
Disease(2)
C17orf103, ADAMTS5, CST2, ADAMTS7, DSCR8, POFUT2, B3GALTL, ADAMTSL1, ADAMTSL4, CFP
57Lipid Levels(2)NCAN, HPSE, APOE, B4GALNT4, IDUA, MSR1, CHST15, APOC4, NDST2, B3GAT3, LRPAP1, CHPF,
TMCC3, B3GAT2, HS3ST6, CHSY1, B3GALT6, GPC2, CSPG4, CLEC2L, CSGALNACT1, GXYLT1, HS3ST2,
VCAN, PLD5, GPC3, B3GAT1, SDC4, CHST3, APOA4, CHPF2, CSGALNACT2, APOC2, SLAMF7, LRP8,
PON3, RBP1, LDLRAP1, KAL1, CHST13, GPR144, SLC35D2, B4GALT7, CHST11, CHST7, HS3ST4,
HS3ST3B1, APOB, GPR111, NDST1, CC2D1A, LRP1, BCAN, CSPG5, XYLT2, DSE, LACRT, SDC3, NKG7,
SDC1, HS6ST2, GLCE, GPC4, SNX17, TSPAN1, MTTP, HS6ST1, GPC5, ITGB1BP1, HS3ST1, HS3ST5,
AGRN, IGSF9, HSPG2, SDC2, GPC1, B4GALNT3, EXT1, APOC3, CHSY3, CHST14, A2ML1, UST, MDK,
GPR97, HPSE2, GPC6, HS3ST3A1, XYLT1, LRP2, PTN, TMCC2, LDLR, CHST12, EXT2, TLL2
145Coronary Artery
Disease(2)
HSD3B1, STS, SOAT1, CYP27A1, CYP11A1, CYP11B2, SULT2B1, CYP3A7, DHCR24, LIPA, TM4SF4,
CYP11B1, CH25H, CYP46A1, CYP7B1, SUSD4, SOAT2, SLC27A5, CYP1A2, HSD3B2, ALS2CR12,
CYP7A1, CYP17A1, FDX1, FDX1L
154Crohn’s Disease(2),
Rheumatoid
Arthritis(2),
Ulcerative Colitis(2)
IL24, IL26, LEPR, IFNAR1, OSMR, IL22RA2, IL23A, RELB, IL28B, IL20RB, CSF2, IFNK, IL7, IL10, CBLC,
IL21, IFNGR1, MPO, IL4, IL12RB1, IL19, TBX21, IL15RA, IL5RA, IL9R, IL2RA, SOCS4, SOCS7, CSH2,
IL7R, IL3, IL28A, IFNAR2, EPO, EPOR, IL28RA, OSM, IL10RA, IL9, IL3RA, LEP, IL5, IL13, IFNW1, CTF1,
IL13RA1, IL11, IFNA13, IL21R, IFNA1, JAK2, GH2, IL15, IL13RA2, IL10RB, CNTFR, IL20, CSF2RB, TSLP,
IFNE, CSH1, IL12A, IL12RB2, CNTF, DOK2, IL2RG, CSF3, IL11RA, IFNGR2, CLCF1, LIFR, IL23R, IL6R,
IL20RA, CSF2RA, IL2RB, IL29, IL6ST, IL2, CRLF2
157Ulcerative Colitis(2)PDCD1LG2, CD3D, CLEC2D, HLA-DRB1, CD274, SLA2, TREML2, PDCD1
174Lipid Levels(2)PVRL4, CD226, PVRL1, CADM3, PVRL2, TIGIT, PARD3, PVR, PVRL3, CD96
176Narcolepsy(2),
Rheumatoid
Arthritis(2),
Ulcerative Colitis(2)
HLA-DQB1, HLA-DQA2, HLA-DMB, HLA-DPA1, HLA-DPB1, HLA-DRA, HLA-DQB2, HLA-DOA, MS4A1,
HLA-DMA, HLA-DQA1, HLA-DOB
184Lipid Levels(5)TBL1X, NCOA1, HELZ2, TBL1XR1, NR1I3, GPS2, CARM1, POU1F1, G0S2, HMGCS1, GLIPR1, SMARCD3,
NCOA6, MED1, PPARA, NCOA2, TGS1, CTGF, CHD9, HMGCR, PEX11A, SULT2A1, GRHL1, NR1I2,
NRBF2, HMGCS2, FADS1, SREBF2, DLX2, PLIN2, CPT2, CPT1A, RGL1, APOA5, SLC27A1
211HbA1C(2)SP110, RNF112, WDR73, TRIM73, SSRP1, TRIM17, CAPNS2, FN3K, HIST2H4A, ERMAP, PEF1, MLL5,
GCA, ZNF618, TRIM69, TRIM60, ATAD2B, KDM5A, TRIM66, TRIM68, SUPT16H, HIST1H4K, DIDO1,
TRIM72, SP140, FSD2, RFPL4B, SDR39U1, SLC20A1, TRIM58, HIST1H4I, SH3RF3, H3F3C, HIST2H3D,
TRIM44, TRIM31, TRIM49B, TRIM51, WDR76, TRIM39-RPP21, TRIML1, HIST1H4B, TRIM43, KDM5D,
TRIM49C, TRIM74, TRIM34, HIST2H3A, SLC20A2, HIST1H3G, TRIM4, HIST1H4G, NHLRC4, HIST1H3F,
SP140L, RFPL1, CHD1, RNF39, PYROXD2, TRIM6-TRIM34, HIST1H3H, SRI, TRIM15, C16orf11, TRIM10,
HIST2H3C, HIST1H3I, TRIM49, TRIM40, TRIM26, PHF20L1, RNF186, BRD4, TRIM64C, TRIM7, MEFV,
TRIM52, HIST1H3B, RNF135, HAT1, SETD7, WDR59, ATAD2, KDM5C, FN3KRP, HIST1H4L, TRIM64B,
TRIM48, TRIML2, HIST4H4, TRIM41, RFPL2
251Psychiatric Disorders(2)DAPK2, MKNK2, CDKL1, MAPKAPK3, CAMK1G, TSSK4, CAMK1, CDK7, STK32C, STK32A, TSSK1B,
NEK3, STK16, TSSK6, MKNK1, PKMYT1, NEK7, CAMK1D, SBK1, MOS, PIM2, CDK10, STK33, NUAK2,
ITIH3, MAP2K2, CDK6, PTK6, PSKH1, CDKL2, PIM1, OXSR1, PIM3, STK17A, NEK6, NUAK1, PSKH2,
ULK3, CDKL4, PDIK1L, PNCK, SBK2, STK17B, PAK2
252Lipid Levels(2)ANKRD61, ANKRD65, ANKRD39, ASB11, ASB9, ASB12, ASB7, ACBD6, ASB1, RFXANK, ASB13, ASB14,
ANKRD7, ANKRD49, ANKRD54, ASB4, ANKRD29, ASB3, CDKN2C, ANKRA2, CDKN2B, ANKRD30BL,
ANKDD1A, ASB8, ANKRD16, OSTF1, ASB10, FANK1, ANKRD23, ANKRD44, CDKN2D, ANKRD1, ANKK1,
ANKRD46, ANKRD22, ANKRD52, ASB5, GABPB1, BCL3, PPP1R27, NFKBID, ANKDD1B, ANKRD2

Table 5. Significant disease/trait modules identified for 2_ppi network by proposed Constrained Louvain method.

Module
Id
Disease/Trait (Number of
GWAS datasets)
Genes/Proteins
81Human Height(3), Waist
Circumference(2)
NPPB, NPR1, NPR3, NPR2, NPPC, NPPA
109Myocardial Infarction(2)SMPD1, GALK2, C12orf4, SLC7A5, GDAP2, MRPS33, RAB23, C7orf43, C14orf50,
DPEP2, CARS2, TMEM50A, SRFBP1, PLBD2, LANCL2, C4orf29, MPND
144Narcolepsy(2),
Rheumatoid Arthritis(2)
CD74, TRBV7-9, CD3D, TRDV2, HLA-DMA, HLA-DPB1, HLA-DQB1, HLA-DQA1,
HLA-DPA1, TRAV29DV5, CD3E, HLA-DOA, CD3G, TRBV19, HLA-DRB4, HLA-E,
TRAV8-4, TRBV12-2, TRAV19, CD247, HLA-DQA2

Table 6. Significant disease/trait modules identified for 3_signal network by proposed Constrained Louvain method.

Module
Id
Disease/Trait (Number of GWAS datasets)Genes/Proteins
1Coronary Artery Disease(2), Lipid Levels(9),
Myocardial Infarction(2)
PCSK9, LDLR, APOB
114BMI(2), Weight(2)UBE3C, TLR2, TLR1, TLR10, SFTPA1, PSMD2, CYBB, NEU1,
TLR6, DHX36, TRAP1
162Age-related Macular Degeneration(2)LTB, LTBR, TNFSF14, TNFRSF14
258Age-related Macular Degeneration(2)C3, CD46, C3AR1, CFB, CR1, CFI, CFH
284BMI(2)THPO, MPL, ATXN2L
331Fasting Glucose(7)GCK, GCKR, DUSP12
337BMI(2)BCL2, BCL2L1, CISD2, TMBIM6, TP53AIP1, ITM2B, SPNS1,
HRK, BCAP31

Table 7. Significant disease/trait modules identified for 4_coexpr network by proposed Constrained Louvain method.

Module
Id
Disease/Trait (Number
of GWAS datasets)
Genes/Proteins
3Hip Circumference(4),
Neuroticism(2),
Psychiatric
Disorders(7), Waist
Circumference(2)
HIST1H2BE, HIST1H2BC, HIST1H3G, HIST1H4A, FLJ13224, HIST1H3A, OR2B6, HIST1H2BG,
HIST1H2BH, HIST1H1D, HIST1H2BI, HIST1H2BF, HIST1H4H, HIST1H2AG, HIST1H2BB, HIST1H2BL,
HIST1H2BN, HIST1H3H, HIST1H4E, HIST1H2AJ, HIST1H2BM, HIST1H2BO, HIST1H2BJ, HIST1H4F,
HIST1H2AE, HIST1H2AK, HIST3H2A, HIST1H3J, HIST1H2AI, HIST2H2AA3, HIST1H4D, HIST2H2BE,
HIST1H4B
39Lipid Levels(4)GLB1, C11orf75, VPS37C, CSF1R, TNFSF13, BLVRB, SH2B3, SCPEP1, NEU1, RNPEP, CFD, BLVRA,
MAN2B1, PION, M6PR, KIAA0930, PLEKHB2, NSMAF, FCGRT, LRPAP1, CAPG, SAMHD1, SIDT2,
MGAT1, FKBP15
48Psychiatric
Disorders(2)
COCH, B4GALT6, TCEAL4, S100A6, CHL1, TSPO, PIK3R3, CXXC4, CAPN2, PIP5K1B, YES1, GAS2,
TRIP6, SLC20A1, TUBB2B, C18orf1, ATP2A3, GLDC, ANXA2P2, HMHB1, ATP8A1, C20orf103,
ACAP1, LRRC59, ENPP4, CTNNAL1, ADAM9, CD200, EMID1, GSTM3, VEGFA, LAPTM4B, LIMA1,
FXYD2, GGA2, S100A4, DAPK1, S100A11, ITGAV, PARM1, SIDT1, CYFIP1, MGAT3, CEBPB, REXO2,
KL, BAG2, IKZF2, WBP5, JAG1, QPRT, VAMP3, PLP2, NCALD, CNIH4, FCGR2B, MXRA7, ASCL1,
GNG7, TAX1BP3, PLS3, ARHGAP6, ANXA1, IPCEF1, REC8, BCL2, DMD, EPHB1, MT2A
53Crohn’s Disease(2),
Rheumatoid Arthritis(3)
DNAH17, IL2RA, IL3RA, ICOSLG, COL9A2, CIITA, GRAP, IL21R, TEC, TNFRSF4, POU2F2,
TCL6, PAOX, IKZF3, STOM, BTK, LILRB1, LILRB4, ADAM28, KMO, SLC2A5, GPR65, SH2D3C,
ST6GALNAC4, CD86, SLC15A2, PCMT1, UGT2B17, ABCB4, PTPN7, GATM, PPFIBP2, DOK3, KLK2,
VNN2, ADAM29, TLR7, STS, BTN2A2, TNFSF11, HLA-DRB4, SH3D21, LY9, FGD2, GH1, PHACTR1,
HSPA1A, HCG26, ALOX5, LOC100505650, HLA-DOA, SCARF1, LTA, TTN, DNASE1L3, CNR1,
CXorf21, ZNF318, PDE6G, TNFRSF9, P2RY10, OAT, RASGRF1, IL7, AKAP5, IGHV5-78, CXCR5,
SLC9A7, PFN2, IRF5, TRAF3, TNFRSF13B
56Rheumatoid Arthritis(3)FLT3LG, TXK, ACSL6, BACH2, ANKRD55, CDKN2D, CCR9, S1PR1, CCR5, HSPB1, ANGPT4,
LIME1, CD96, CD28, TNFRSF25, UBASH3A, GPR171, GPA33, CCR4, SIRPG, TNFSF8, XCL1, CD8B,
KLRD1, PVRIG, STAT4, TBX21, TRAV8-3, FASLG, TRD@, CD7, RORA, GFI1, CXCR6, SH2D1A,
LOC79015, CAMK4, LAG3, IL23A, LRRN3, SPINK2, TRAT1, KLRG1, IFNG, EMR1, SH2D2A, CD3G,
CHMP7, KCNA3, CD6, MGST3, GZMM, ICOS, CD5, SLAMF1, PTPN4, CCR8, PDCD1, TRBC2,
LOC100507397, RCAN3, TRBV10-2, OCM2, SIT1, PRKCQ
92Lipid Levels(5)SNTA1, DEXI, KLHDC3, PHKB, EEF1A1, MID1IP1, SLC25A11, OGDH, VPS4A, GSTM2, MAP7D1,
SCN1B, CARM1, KEAP1, USO1, GSTM1, POLDIP2, PHKG1, VPS52, FAM89B, GPS2, TRIP10,
SLC2A4RG, FHOD1, CTDNEP1, ARL2, RNF123, UQCC, LRRC20
99Lipid Levels(4)C9, F7, CRP, TMPRSS6, TAT, SLC17A2, HMGN2, LECT2, MASP2, C19orf80, TRMT5, LIPC, ABCG5,
APOF, SPP2, CFHR5, FGF21
104Lipid Levels(3)CYP7A1, GCKR, CLEC4M, PKLR, CRYAA, PRG4, DDO, IGFALS, LPA, FTCD, FN3K, C14orf105,
SEC14L4, F13B, MASP1, CLDN16, CPN2, ART4, ADRA1A, FOLH1B, HGFAC, HAAO, FOLH1, MBL2,
SLC7A9, DNMT3L, MLXIPL, CA5A, ABCG2, FETUB, LPAL2, CYP3A43, CCL16, F11, GPER, SARDH,
HNF4A, GPLD1, CPS1-IT1, NAT8, SLC38A3, APOA4, ONECUT1, EPO, SHBG, HNF1A, SLC26A1,
MBNL3, UPB1, NR1I3, ALDOA, RHBG, PON1, CPN1, CCNI, CYP2C19, PROZ, TTPA
107Lipid Levels(5)C3orf32, SERPINA6, ADH6, SULT2A1, SERPINA4, C4BPA, RGN, C8A, PLG, UGT2B4, SERPINF2,
PGC, SERPINA10, ITIH1, HPR, MTTP, PROC, ANGPTL3, AKR1D1, MAT1A, BHMT
109Psychiatric Disorders(2)ENDOU, IL37, WNT3, DNASE1L2, KCNK7, KRTAP2-4, KRTAP9-9, KRT83, KRTAP1-3, BPY2, KRT35
126Leptin(2)HSPB7, CCDC48, HSPB2, BAALC, CSPG4, SLC16A4, MAP1A, SGCA, CSDC2, DNAJB5, NFASC,
FHL5, PLEKHA4, STK32B, DAAM2, TRO, SPEG, ADAMTSL3, TMEM100, CLIP3, CACNA1C, TBX5,
GPC4, SLC26A10, GREM2, LTBP4, C8orf84, RRAD, EMILIN1, RAB23, HSPB6, HSPA12A, C7orf58,
TACR2, ADAMTS8, ITGA1, CYTL1, SLC2A10, SCN7A, ARHGAP24, GPM6A, PRKG1, RAB40A,
NBLA00301, SCRG1, HSPB3, SNAI1, AGTR2, IL17B, BEX1, SGCD, PER1, PKNOX2, CHRM2, FGF7,
PDE5A, SMAD9, ENOX1, PGM5, NDNF, HPSE2, ARNT2
135Waist Circumference(3)TGFB1I1, FBLN1, TJP1, AXL, CAV2, COL5A1, TIMP3, FBN1, WWTR1, TPM2, UBE4A, LOXL2,
OLFML3, FAP, PCOLCE, NUPR1, CTGF, LTBP2, SEPT10, MFAP2, TNC, FN1, PRSS23, PXDN,
CDKN1A, CALD1, NID1, TMEM47, LOXL1, MRC2, PPAP2B, FBLN5, PPIC, IL1R1, LARGE, MYO1B,
LHFP, MYL9, NID2, LOX, FLRT2, RASL12, C6orf145, OLFML2A, SNAI2, LAMB1, THBS1, PPAP2A,
EFEMP1, DSE, ENAH, MAP1B, IGFBP3, DKK3, F2R, ADAMTS1, FERMT2, ARHGAP29, CDH11,
MYLK, MYOF, COL1A1, NNMT, COL5A2
138Narcolepsy(2),
Rheumatoid Arthritis(3)
HLA-E, CTSC, KRT19, LAP3, LYZ, HLA-G, HCLS1, LCP1, HLA-DPA1, UCP2, TAPBP, RAC2, HLA-B,
GRB10, LYN, SH3BGRL, EIF2C2, LIPA, GRB14, CD74, CNDP2, HLA-F, LAPTM5, MYD88, DLG5,
HLA-DRB1, HLA-C, TRIM22, HLA-DPB1, CD53, SRGN, HLA-DMA, LGMN, IFI30
184Obesity(2)CINP, NDOR1, FAM158A, ZMAT5, HLCS, SURF2, KCTD2, LIN37, TELO2, C4orf10, ZNF408, CCDC22,
COQ6, BAD, C17orf59, RNF25, LIN7B, TBL3, TUG1, RPS6KB2, C21orf2, PIGH, SART1, BRF1,
TMEM110, AAGAB, AZI1, SSSCA1, ZNHIT2, NUDT2, PGP, TMEM104, ROM1, ARMC7, MKL1, AKIP1,
SUGP1, GTF3C5, E4F1, PPP2R2D, C2CD2L, ETV2, NADSYN1, NUBP2, LOC100129250, C11orf51,
WDR25, GPKOW, KCTD17, TMED1, BCL7C, THAP7, NOC4L, TBCD, EXOC3, GNB1L, FAM3A,
KLHDC4, NKIRAS2, OPHN1, PIN1, FAU, SNAP29, COMMD9, PUM2, C17orf90, FAM3C, C16orf42,
SHARPIN, BNIP1, TXNRD2, PIN1P1, ZNF839, CCDC101, DHRS7B, PANK3, PRMT7, WDR13, DDX49,
TMEM11, ASPSCR1, TSR2, ZFPL1
187Crohn’s Disease(2)ARFGAP1, PCGF3, TAF1C, RTEL1, MUS81, BRD9, CDK10, SH3BP2, INPP5E, C19orf54, ABCC10,
SPG7, MAN1B1, DOM3Z, RAD9A, CEP164, NFRKB, MST1, CLASRP, NELF, TJAP1, ASXL1, SLC35C2,
TXLNA, PLXNA3, SZT2, SFI1, ATG4B, ASAH1, INPPL1, FAM193B, CUL9, APBA3, RHOT2, SKIV2L,
MDC1, RBM14, PAQR6, SLC26A6, FAM193A, PIGO, HTT, MOGS, C9orf86, MFSD10

Table 8. Significant disease/trait modules identified for 5_cancer network by proposed Constrained Louvain method.

Module
Id
Disease/Trait (Number
of GWAS datasets)
Genes/Proteins
107Neuroticism(2)ARSH, ATP2B4, CTSB, MYT1L, MMP21, LIM2, DCSTAMP, KCNT2, ZNF442, T, POC1B-GALNT4,
EXOC4, NFATC2, NOD2, SIM1, MYLIP, PGK2, C1orf109, FAAH2, ZNF436, TCF4, NEK1, CLIC2,
TMEM206, DIAPH1, CYP4F22, MNDA, CLDN17, GALNT4, HELT, GNPAT, CNGA2, ZGPAT, PPP1R16A,
HIVEP1, CSNK2A3, UQCRBP1, RUNX1T1, CYP24A1, ENPP6
181Hip Circumference(2)NT5C1B-RDH14, ALDH8A1, ZNF155, GNA12, B3GAT2, SULT1B1, HHIP, AAGAB, TMEM119, RDH14,
DNAJA1
211BMI(2), Obesity(2)TACC2, BDNF, SLC16A3, AKAP10, PLS3, FAM19A3, PABPC5
656Neuroticism(2)LRRC37A4P, INPP5B, LRRC37A, VAMP3, LRRC37A2, LRRC37A3
666BMI(4), Waist
Circumference(2)
SULT1A3, SULT1A2, SULT1A1, SLX1A, SLX1A-SULT1A3, SLX1B, SULT1A4, SLX1B-SULT1A4

Table 9. Significant disease/trait modules identified for 6_homology network by proposed Constrained Louvain method after the challenge.

Module
Id
Disease/Trait (Number
of GWAS datasets)
Genes/Proteins
105Coronary Artery
Disease(2)
RNASE7, KDM2A, SLC24A1, ESRP2, ASNS, MARCH9, BZW1, EDDM3B, VPS13D, APPBP2,
NAA16, THAP4, VWA3B, CYP3A43, FABP2
198Lipid Levels(4)SOCS3, NUDT19, AKR1C1, ZNF714, IGFBP6, CAT, ARHGAP32, PITPNC1, NFKBIL1, MAD2L2,
EIF1B, LPO, ZNF620, TMEM204, DAND5, ARHGAP25, KRCC1, SP1, FEZF1, LMNTD2, OOSP1,
TMED1, HOXA4, SLC36A4, FAM71F2, ASPM, FBXL20, OR5I1, HBG1, SFTPC, APOC4-APOC2,
HEXB, ZNF521, TRIM56, CHPT1, IFT20, MLXIP, AJUBA, IDE, GMIP

Data availability

The Challenge datasets for registered participants are available at: https://www.synapse.org/#!Synapse:syn6156761/wiki/400659. Challenge documentation, including the detailed description of the Challenge design, overall results and scoring scripts can be found at: https://www.synapse.org/#!Synapse:syn6156761/wiki/400647.

Source code for the proposed framework is available from: https://github.com/raghvendra5688/DMI/tree/DMI_v1.0

The archived source code for the proposed framework along with a README file can be found at: https://doi.org/10.5281/zenodo.119742424.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 26 Mar 2018
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Mall R, Ullah E, Kunji K et al. An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity [version 1; peer review: 2 approved with reservations] F1000Research 2018, 7:378 (https://doi.org/10.12688/f1000research.14258.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 26 Mar 2018
Views
14
Cite
Reviewer Report 08 Jun 2018
Yunpeng Liu, Massachusetts Institute of Technology (MIT), Cambridge, MA, USA 
Approved with Reservations
VIEWS 14
In this paper, the authors describe a new pipeline for identifying disease modules from large-scale biological networks in the DREAM challenge. The pipeline builds upon off-the-shelf hierarchical community detection methods and first generates an initial partitioning of the network using ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Liu Y. Reviewer Report For: An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity [version 1; peer review: 2 approved with reservations]. F1000Research 2018, 7:378 (https://doi.org/10.5256/f1000research.15518.r34227)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
20
Cite
Reviewer Report 08 May 2018
Eric E Schadt, Department of Genetics & Genomic Sciences, Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, USA 
Approved with Reservations
VIEWS 20
In their paper, “An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity”, Mall et al present an approach to identifying modules of genes from different types of networks, where their ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Schadt EE. Reviewer Report For: An unsupervised disease module identification technique in biological networks using novel quality metric based on connectivity, conductance and modularity [version 1; peer review: 2 approved with reservations]. F1000Research 2018, 7:378 (https://doi.org/10.5256/f1000research.15518.r32972)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 26 Mar 2018
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions