BTW 151
BTW 151
BTW 151
doi: 10.1093/bioinformatics/btw151
Advance Access Publication Date: 19 March 2016
Original Paper
Abstract
Motivation: Analysis of co-expressed gene sets typically involves testing for enrichment of differ-
ent annotations or ‘properties’ such as biological processes, pathways, transcription factor binding
sites, etc., one property at a time. This common approach ignores any known relationships among
the properties or the genes themselves. It is believed that known biological relationships among
genes and their many properties may be exploited to more accurately reveal commonalities of a
gene set. Previous work has sought to achieve this by building biological networks that combine
multiple types of gene–gene or gene–property relationships, and performing network analysis to
identify other genes and properties most relevant to a given gene set. Most existing network-based
approaches for recognizing genes or annotations relevant to a given gene set collapse information
about different properties to simplify (homogenize) the networks.
Results: We present a network-based method for ranking genes or properties related to a given
gene set. Such related genes or properties are identified from among the nodes of a large, hetero-
geneous network of biological information. Our method involves a random walk with restarts, per-
formed on an initial network with multiple node and edge types that preserve more of the original,
specific property information than current methods that operate on homogeneous networks. In
this first stage of our algorithm, we find the properties that are the most relevant to the given gene
set and extract a subnetwork of the original network, comprising only these relevant properties.
We then re-rank genes by their similarity to the given gene set, based on a second random walk
with restarts, performed on the above subnetwork. We demonstrate the effectiveness of this algo-
rithm for ranking genes related to Drosophila embryonic development and aggressive responses
in the brains of social animals.
Availability and Implementation: DRaWR was implemented as an R package available at veda.cs.il-
linois.edu/DRaWR.
Contact: blatti@illinois.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction associating the gene set with other previously annotated sets. These
A common task in bioinformatics is to characterize co-expressed pre-existing gene sets may be defined from many diverse types of
gene sets using enrichment methods, such as Hypergeometric tests biological knowledge, such as shared protein domains, evolutionary
or gene set enrichment analysis (GSEA) (Subramanian et al., 2005), origins, biological processes, etc. Public databases of curated
annotations that enable this paradigm of gene set characterization still discards the specific details about the gene–gene relationships
are highly diverse and rapidly increasing. This work addresses the when constructing each affinity network. For example, the edges
challenge of incorporating heterogeneous data from multiple public within a Pfam protein domain affinity network indicate that a pair
resources into the task of characterizing the shared properties of a of genes share a protein domain sequence, but does not preserve
given gene set and identifying additional genes that are important which domain(s) it may have been.
and related. We developed the DRaWR (‘Discriminative Random Walk with
One broad approach employed to perform gene set analysis with Restarts) method to rank genes for their relatedness to a given gene
these different public resources is to represent the data as a biolo- set, using biological networks that maintain detailed information
gical network. Rather than using each data source one at a time to from public data sources. Our algorithm is explicitly designed to
analyze a co-expressed gene set, sources may be integrated within a work on heterogeneous networks with multiple node types that are
network and simultaneously leveraged to identify related genes. This able to represent a complete collection of public, genomic know-
idea was tested in the ‘MouseFunc’ challenge (Pena-Castillo et al.,
Additionally, in our application to Drosophila gene sets, we of how an RWR algorithm works is often understood with a
introduced hundreds of ‘motif’ feature nodes that represent distinct ‘walker’ that traverses the nodes of a network. With probability (1-
binding specificities of fruitfly transcription factors (TFs). A motif c), where c is the restart parameter, the walker follows an outgoing
node was connected to all genes whose 5 kb upstream regulatory re- edge to a neighboring node and with probability c, the walker resets
gion contain the motif, i.e. if the regulatory region includes one of the walk by transporting directly to one of the genes in the ‘restart
the top 0.5% of the highest scoring 500 bp windows genome-wide set’, defined as the query set Q in our algorithm. In properly formed
for that motif, as scored by the Stubb program (Sinha et al., 2006). networks in the long run, the probability distribution of the walker
The weights on these edges were the z-transform of that window’s over all nodes will converge to a stationary distribution. This distri-
empirical P-value (see Supp Methods SM1, Supplementary Table bution produces a ranking on all nodes that incorporates the con-
S3). Also for the D.melanogaster study, we incorporated 75 ‘ChIP’ nectedness of the node in the network as well as the proximity of the
feature nodes, representing TF occupancy obtained from separate node to the query set. In the first stage of our DRaWR algorithm,
where all of the homology edges were contained in the submatrix vtþ1 ¼ ð1 cÞAvt þ ca (4)
MGG , while MFi G and MGFi were the submatrices that represent
where c is the restart probability and a reflects the probability of
(weights of) edges between all feature nodes of type i and gene nodes
jumping to a gene in the restart set. When the restart set is defined as
in G. There were no edges between feature nodes, meaning MFi Fj
the set of query genes Q, then
¼ 0 for all i, j.
( 1
Q jQj
for gene nodes in Q
2.2 Functional annotation from two-stage random walk ai ¼ (5)
0 otherwise
Given a heterogeneous biological network M, a gene set Q referred
to as the ‘query’ set, and the universe U of all genes to rank (U G), As the random walk is irreducible and aperiodic, the iterative up-
we employed a two-stage algorithm based on a modified random date of this procedure is guaranteed to converge to the stationary
walk with restart (RWR) approach (Tong et al., 2006) to rank the distribution of the random walk regardless of the initial probability
gene nodes of U. The algorithm additionally ranks the feature nodes distribution v0 . We ran iterations of the RWR with the query set
in the network M by their relevance to the query set Q. The intuition defining the restart set (a ¼ aQ ) until the vector vt converged
2170 C.Blatti and S.Sinha
0
(vtþ1 vt < 0:05). We notate this converged probability distribu- ~ Q (see Fig. 1C). The ranking of all nodes induced
relevance vector v
tion as ~v Q (see Fig. 1B). The ranking of all nodes of M by the proba- by this new relevance vector was called the ‘stage 2 query ranking’.
bilities of ~v Q is referred to as the ‘stage 1 query ranking’.
We wanted the ranking from the first stage to discriminate fea- 2.2.2 Evaluation of two stage RWR algorithm
ture nodes that are related to the query set Q from those feature We employed a cross validation scheme to evaluate the results of
nodes that have high ranking in ~ v Q simply due to their high connect- our ranking method. For each given query gene set, we held out
ivity in the network. To do this, we must also produce a ranking of 10% of the genes for testing, QTe , and the remaining 90% of the
nodes that does not depend on the query set. Therefore, in the first gene set are supplied to the algorithm as the query set QTr . With a
stage of DRaWR, we repeated the RWR procedure using the uni- query set QTr , we produced the ‘stage 1 query rankings’, identified
verse set U of all genes as the restart set (in place of set Q above). the relevant features nodes, extracted the query-specific subnetwork,
We thus arrived at a second converged relevance vector ~ v U (see and repeated the RWR to produce the stage 2 query ranking. From
Fig. 1A) and refer to the ranking it induces on all nodes as the ‘stage the calculated rankings and the held out test sets QTe , we produced
1 baseline ranking’. Note, ~ v U captures the overall relevance/import- receiver operating characteristic (ROC) curves and quantified the
ance of each node in the network without regard to the query set, performance of our algorithm with the area under these curves
whereas ~v Q incorporates overall network structure as well as prox- (AUROC).
imity to the query set. Therefore, to find the feature nodes most spe-
cifically relevant to the query genes, we examined the difference
between these vectors, ~ vQ ~ vU . 3 Results
For the second stage of our two-stage RWR, we selected the 50
3.1 Applications to Drosophila developmental genes
k (k is the number of feature types) most query-specific feature
We first applied the DRaWR algorithm to sets of genes defined
nodes, defined as having the greatest values in ~ vQ ~ v U , and created
based on in situ hybridization images of gene expression in
a subnetwork M’ from the initial matrix M by removing all other
Drosophila embryos from BDGP (Tomancak et al., 2002). For this
feature nodes and their adjacent edges. Thus,
analysis, we focused on 92 spatio-temporal expression patterns (or
2 3
MGG MGF0 MGF0 ‘domains’) that contained between 100 and 1200 genes with the spe-
1
6 k 7
cific expression pattern. We applied the DRaWR algorithm to genes
6 7
6 .. 7
6 .. 7 of each expression domain separately and evaluated gene rankings
6 MF 0 G . . 7
6 7 with the AUROC on the held out test set. In this application, we
M0 ¼ 6
1
7 (6)
6 .. .. .. 7 tested the feasibility of our algorithm to find additional genes related
6 . . . 7
6 7
6 7 to each query set (using the AUROC measures described above).
4 5
MF0 G MF 0 F 0 This application is important in instances where experimental anno-
k k k
tation of genes has a non-trivial cost (as with constructing and imag-
0
where Fi represented only the selected feature nodes of feature type ing in situ hybridizations). Predicting other genes that share the
i. Using the same normalization procedure as above, we renormal- expression pattern of the query set can provide investigators a man-
ized M’ by type and converted it to the transition probability matrix ageable number of additional genes to assay.
A’. We repeated the random walk using A’ and aQ (restart set We began by creating a Drosophila-specific heterogeneous net-
defined from the query set Q) until we converged to the new work that contained gene nodes connected by ‘homology’ edges as
Discriminative random walks with restart 2171
network defined from Pfam domain edges as well as genetic and Table 1. Ten query-specific features
protein interaction edges (SM5 and Supplementary Table S16). We
Rank Feature node name Feature node type
found that our random walk based approach on type-normalized
heterogeneous networks produces similar average AUROC (0.7051) 1 Striatum Brain Atlas
to the GeneMANIA label propagation method on the weighted com- 2 Retrohippocampal Brain Atlas
bination of the corresponding gene–gene affinity networks 3 Hippocampus Brain Atlas
(Supplementary Fig. S4). In both analyses, our algorithm was also 4 Pallidum Brain Atlas
5 MRJP Prot Domain
able to report the most relevant protein domains, a capability that
6 PMP22_Claudin Prot Domain
GeneMANIA lacks.
7 JHBP Prot Domain
8 Globin Prot Domain
3.3 Application to multi-species behavioral aggression 9 Olfactory Brain Atlas
enabled us to integrate experimental results from different species There are several limitations to the random walk based ap-
with knowledge from many different sources in a single framework. proach. First, we are only able to represent positive information.
We also examined the aggression-related DE gene set of each Edges are only able to convey how closely related two nodes are and
species separately to check if these gene sets have varying levels of nodes are only allowed to be annotated as belonging to the given
coherence that may make it more or less difficult to identify related gene set. However, proper use of negative information may perhaps
genes. For each species, we tested ranking accuracy on held out DE create a more nuanced network and produce better outcomes. For
genes, using either the 5-species network or a single-species network example, we may want to add edges that represent mutual exclusiv-
appropriate for that species. In general, we found that the species- ity or strong anti-correlation between two nodes in the network. We
specific DE gene sets that were the most difficult to correctly rank may also have negative examples of a property of interest that we
their related genes using only the appropriate single species network would like to incorporate to make rankings more accurate. Many of
showed the greatest improvement when using the multi-species net- these properties may be addressed by remapping our random walk
Hofree,M. et al. (2013) Network-based stratification of tumor mutations. Pena-Castillo,L. et al. (2008) A critical assessment of Mus musculus gene
Nat. Methods, 10, 1108–1115. function prediction using integrated genomic evidence. Genome Biol., 9, S2.
Hou,J.P. and Ma,J. (2014) DawnRank: discovering personalized driver genes Reimand,J. et al. (2008) GraphWeb: mining heterogeneous biological net-
in cancer. Genome Med., 6, 56. works for gene modules with functional significance. Nucleic Acids Res.,
Ivan,G. and Grolmusz,V. (2011) When the Web meets the cell: using personal- 36, W452–W459.
ized PageRank for analyzing protein interaction networks. Bioinformatics, Rittschof,C.C. et al. (2014) Neuromolecular responses to social challenge:
27, 405–407. common mechanisms across mouse, stickleback fish, and honey bee. Proc.
Jacquemin,T. and Jiang,R. (2013) Walking on a tissue-specific disease–pro- Natl. Acad. Sci. U. S. A., 111, 17929–17934.
tein-complex heterogeneous network for the discovery of disease-related Rozowsky,J. et al. (2009) PeakSeq enables systematic scoring of ChIP-seq ex-
protein complexes. Biomed. Res. Int., 2013, 732650. periments relative to controls. Nat. Biotechnol., 27, 66–75.
Johansson,A.K. and Hansen,S. (2001) Increased novelty seeking and decreased Salwinski,L. et al. (2004) The Database of Interacting Proteins: 2004 update.
harm avoidance in rats showing Type 2-like behaviour following basal fore- Nucleic Acids Res., 32, D449–D451.
brain neuronal loss. Alcohol. Alcohol., 36, 520–524. Shen,R. et al. (2012) Mining functional subgraphs from cancer protein-protein