Scalable learning of large networks

Lane, T.; Roy, S.; Werner-Washburne, M.; Plis, S.

Scalable learning of large networks Sushmita Roy§ , Sergey Plis§ , Margaret Werner-Washburne+ and Terran Lane§ § Department of Computer Science, University of New Mexico + Department of Biology, University of New Mexico Abstract dencies among arbitrarily sized groups of nodes. When Cellular networks inferred from microarray data provide the network is not known, network structure inference important insight into genetic interactions in cells under algorithms are used to infer the structure. As optimal is known to be NP-hard, different conditions. Unfortunately, many network infer- de-novo network reconstruction k−1 polynomial time (O(n )) algorithms learning networks ence algorithms do not scale to whole-genome data with with bounded degree have been developed [1]. Here n hundreds of variables. We propose a novel divide and is number of nodes and k is the degree. Unfortunately, conquer approach, Cluster and Infer Networks (CIN), for scalable learning of large networks. CIN learns network when n becomes very large (several hundreds) and k is structures in two steps: (a) pre-cluster network nodes and of any interesting size (2 ≤ k ≤ 10), as is the case for learn networks per cluster, (b) revisit the cluster assign- genome-scale networks, these algorithms become expensive in time and memory. ment of variables with poor neighborhoods. We present a tractable approach to learn the structure Results on networks with known topologies suggest of cellular networks represented as undirected graphithat CIN has substantial speed benefits, without signifcal models. Our approach, Cluster and Infer Networks icant performance loss. We applied our approach to a (CIN), clusters the nodes into smaller, possibly overmicroarray compendium of glucose-starved yeast cells. lapping groups and learns separate networks per cluster. Analysis of the inferred structure identified several subBy partitioning the nodes into smaller groups, we avoid networks involved in different metabolic processes. Our searching over the complete node set, resulting in runtime results are consistent with existing literature and probenefits. Because the initial clustering may not be pervide fine-grained, network-level information of cellular fect, we iteratively reassign nodes to clusters to improve response mechanism under starvation conditions. the quality of the node neighborhoods, repeating the proKeywords: Cellular networks, network structure infercedure until convergence. The complete network strucence ture is obtained by combining the networks inferred per 1 Introduction cluster. CIN is a meta-algorithm and can work with any existCellular adaptations essential for survival under changing algorithm of learning network structure. We used the ing environmental conditions are driven by a complex, Markov blanket search (MBS) algorithm [6], for learning but coordinated, set of interactions among genes, proteins the structure of undirected graphs (specifically, Markov and metabolites. Identification of these interactions using random fields). The MBS algorithm uses an information whole-genome microarray data is crucial for understandtheoretic score to identify the best neighborhood, Markov ing the functional aspects of these networks and, thereblanket, for each variable. fore, how cells respond to changing environmental conditions. We compare CIN with and without cluster reProbabilistic graphical models are well-known frame- assignment, against standard MBS that infers networks works for modeling cellular networks [5]. Such models over the entire, unpartitioned node set. CIN gives signifare especially suitable for identifying higher-order depen- icant speed improvements, without significant accuracy 1 loss of the inferred structures, when evaluated on simulated data from networks with known structure. Adding cluster reassignment further improves performance, still keeping CIN faster than MBS with no clustering. Nocluster We evaluated the performance by comparing how well different sub-graphs in the true networks matched subgraphs in the inferred networks. Specifically, we matched: (a) edges in the true and inferred networks, (b) edges in the true network to paths in the inferred network and viceversa, (c) shortest paths in the true and inferred networks, (d) sub-graphs per vertex generated from its 1-step neighbors (1-N). We also applied our algorithm to real microarray data from the yeast compendium [3]. We generated the 1-step neighborhood subgraphs (1-N) for the inferred network and obtained gene ontology process annotation for each subgraph. 20 15 10 5 0 ECOLI‐ALL ECOLI‐TF G75‐ALL G75‐TF G50‐ALL G50‐TF Datasets Figure 1: Run time for different algorithms; lower runtimes are better. Algorithms were compared using three networks of known structure. Per network, two datasets were generated by either perturbing all genes (ECOLI-ALL, G75-ALL, G50-ALL) or only transcription factor genes (ECOLI-TF, G75-TF, G50TF). 3 Experiments We evaluated the merit of our CIN approach on data generated from networks of known topology. We used three networks, ECOLI, G75 and G50, with n = 188, 150, and 100 nodes respectively. ECOLI is obtained from the regulatory network of the bacteria E. coli [7], and G75 and G50 are artificial regulatory networks generated from a network simulator. These networks are sufficiently large to require our pre-clustering approach, yet small enough to enable structure learning via an approach without clustering. The k-means algorithm was used to cluster the data. Revisit 25 3.1 2 Norevisit 30 Time(sec) We also applied CIN to a yeast microarray compendium of glucose-starvation induced stationary phase. We were able to identify several subgraphs that were enriched in a large number of metabolic processes (e.g. glycolysis, acetyl-CoA metabolic process), characteristic of yeast cells in starvation conditions. A large proportion of subgraphs enriched in biological functions (p-value < 10−5 ) were likely to be true positives (false discovery rate < 0.05). This indicates that our approach accurately and efficiently capture dependencies crucial for understanding cellular stress response mechanisms. Run0me 35 Results CIN has significant speed benefits without substantial performance loss We analyzed CIN in combination with the Markov blanket search (MBS) algorithm on three networks of known topology. We considered three approaches to learn networks: (a) CIN with MBS with no cluster reassignment (Norevisit), (b) CIN with MBS with cluster reassignment (Revisit), and (c) standard MBS (Nocluster). The running time of CIN with MBS, No-revisit and Revisit, is significantly smaller than standard MBS (Fig. 1). The quality of the inferred networks using both CIN approaches are comparable to MBS on the complete variable set (Fig. 2). We show results only for edge-path match. Results from edge-edge, path-path and 1-N subgraphs are similar. Overall, we found that CIN had significant speed benefits over standard MBS. Revisiting clusters improves results at additional runtime cost, but is still faster than standard MBS. 3.2 CIN identifies subgraphs enriched in meaningful biological processes from yeast stationary phase compendium. We applied CIN with MBS to a recently generated microarray compendium of yeast in stationary phase. As the true network is not known for these data, we used Gene Ontology (GO) for validation of the inferred subgraphs [4]. At a stringent enrichment threshold (p-value < 10−5 ), 89% of the subgraphs, enriched in a GO term, were Edge‐Path match 0.7 Nocluster Norevisit Revisit 0.6 Score 0.5 Isolation of quiescent and nonquiescent cells from yeast stationary-phase cultures. J Cell Biol, 174(1):89–100, July 2006. 0.4 [3] Anthony D. Aragon, Angelina L. Rodriguez, Osorio Meirelles, Sushmita Roy, George S. Davidson, Chris 0.1 Allen, Ray Joe, Phillip Tapia, Don Benn, and Mar0 garet Werner-Washburne. Characterization of differECOLI‐ALL ECOLI‐TF G75‐ALL G75‐TF G50‐ALL G50‐TF Datasets entiated quiescent and non-quiescent cells in yeast stationary-phase cultures. Molecular Biology of the Figure 2: Match scores for different algorithms; higher scores Cell, 2008. are better. Nocluster: MBS algorithm without pre-clustering. 0.3 0.2 Norevisit: CIN with MBS without cluster reassignment. Revisit: CIN with MBS with cluster reassignment. [4] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issellikely to be true positives (FDR < 0.05). A fine-grained Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. analysis of the subgraphs implicated several metabolic Richardson, M. Ringwald, G. M. Rubin, and G. Sherpathways (glycolysis, pyruvate metabolism, acetyl-CoA lock. Gene ontology: tool for the unification of bimetabolic process), which are in agreement with known ology. The Gene Ontology Consortium. Nat Genet, literature [2]. We also found one subgraph involved in ag25(1):25–29, May 2000. ing (SSD1, SNF1, YFL042C, PDR1, SCH9) and another in cell cycle arrest (YDR196C, HRT3, FAR8, FAR3). [5] Nir Friedman. Inferring cellular networks using probThese cells have been hypothesized as models for aging abilistic graphical models. Science, 303:799–805, studies, and we provide specific candidate networks that 2004. can provide deeper understanding of aging and other cellular processes involved in proper stress response. [6] Sushmita Roy, Terran Lane, and Margaret WernerWashburne. Learning structurally consistent undi4 Conclusion rected probabilistic graphical models. Submitted to We present here a tractable approach, Cluster and InNIPS, 2008. fer Networks (CIN), for learning large networks from genome-scale expression data. Our results on networks [7] Heladia Salgado, Socorro Gama-Castro, Martin Peralta-Gil, Edgar Diaz-Peredo, Fabiola Sanchezwith known topology indicate that CIN has significant Solano, Alberto Santos-Zavaleta, Irma Martinezspeed improvements without incurring substantial perforFlores, Veronica Jimenez-Jacinto, Cesar Bonavidesmance loss. Analysis of networks inferred from glucoseMartinez, Juan Segura-Salazar, Agustino Martinezstarvation yeast microarray data identified several biologAntonio, and Julio Collado-Vides. Regulondb (verical meaningful dependencies. We are currently applying sion 5.0): Escherichia coli K-12 transcriptional reguCIN to microarray data under other environmental conlatory network, operon organization, and growth conditions, to get a better understanding of the conditionditions. Nucleic Acids Research, 34:D394, 2006. specific rewiring of cellular networks. References [1] Pieter Abbeel, Daphne Koller, and Andrew Y. Ng. Learning factor graphs in polynomial time and sample complexity. JMLR, 7:1743–1788, 2006. [2] C. Allen, S. Büttner, A. D. Aragon, J. A. Thomas, O. Meirelles, J. E. Jaetao, D. Benn, S. W. Ruby, M. Veenhuis, F. Madeo, and M. Werner-Washburne.

RELATED PAPERS

RELATED TOPICS

Log In

Scalable learning of large networks

Scalable learning of large networks