Confounding Factors in HGT Detection: Statistical Error, Coalescent Effects, and Multiple Solutions

Cuong Than

Confounding Factors in HGT Detection: Statistical Error, Coalescent Effects, and Multiple Solutions

Journal of Computational Biology, 2007

518 THAN ET AL. 1. INTRODUCTION W HEREAS EUKARYOTES EVOLVE mainly though lineal descent and mutations, bacteria obtain a large proportion of their genetic diversity through the acquisition of sequences from distantly related organisms via horizontal gene transfer (HGT) (Doolittle et al., 2003; Ochman et al., 2000). There has been a large “ideological and rhetorical” gap between researchers who believe that HGT is so rampant that a prokaryotic phylogenetic tree is useless and those who believe that HGT is mere “background noise” which does not affect the reconstructibility of a phylogenetic tree for bacterial genomes. Supporting arguments for these two views have been published. For example, the heterogeneity of genome composition between closely related strains (e.g., in Escherichia coli only 40% of genes are shared in common by three E. coli strains [Welch et al., 2002]) supports the former view, whereas the well-supported phylogeny reconstructed by Lerat et al. (2003) from about 100 “core” genes in  -Proteobacteria gives evidence in favor of the latter view. Nonetheless, regardless of the views and the accuracy of the various analyses, there is a consensus as to the occurrence of HGT and the evolutionary role it plays in bacterial genome diversiﬁcation. Further, HGT is a main process by which bacteria develop resistance to antibiotics (Paulsen et al., 2003), is considered a primary explanation of incongruence among gene phylogenies, and is a signiﬁcant obstacle to reconstructing the Tree of Life (Daubin et al., 2003). The HGT detection problem concerns the detection of the genes that are horizontally transferred into the genome, the donors and recipients of every horizontally transferred gene, and the number of HGT events that occurred during the evolutionary history of a set of species. When HGT occurs, the evolution- ary history of the gene(s) involved does not necessarily agree with that of the species phylogeny. This observation is the fundamental basis of the phylogeny-based HGT detection approach: trees for individual genes are reconstructed (and sometimes a species tree is reconstructed as well, using other data), and their disagreements are identiﬁed to estimate the number (how many) as well as locations (donors and recipients) of HGT events. Beside the computationally challenging problem of quantifying disagreements among trees for the sake of detecting HGT, major challenges that face this approach include (1) determining whether the disagreements are indeed due to HGT, and (2) whether there is a unique HGT “scenario.” Yet, these two challenges encompass a host of issues of which we address three. First, since trees are at best partially known, they have to be reconstructed using a phylogeny reconstruction method. We investigate the impact that the quality of reconstructed trees has on HGT detection. Second, under the assumption that HGT is actually the source of tree disagreements, we investigate the uniqueness of a solution to the HGT detection problem, and establish bounds on the number of possible minimal HGT scenarios. Finally, among closely related species, lineage sorting due to random genetic drift may also cause tree incongruence, thus mimicking the effects of HGT on phylogenies. In this case, accurate HGT detection requires determining the actual cause of tree incongruities, and making the appropriate reconciliation. We make preliminary progress on incorporating HGT into the coalescent model, so as to produce a stochas- tic framework for classifying population-level events (such as lineage sorting) and species-level events (such as HGT). In addition to these three confounding factors, we consider the problem of enumerating all valid coales- cent scenarios that constitute plausible species/gene tree reconciliations, and develop a polynomial dynamic programming algorithm for solving it. This result bears a great signiﬁcance on reducing the solution space for heuristics that search for reconciliation scenarios. Finally, we show, empirically, that the locality of incongruence between a pair of trees has an impact on the numbers of HGT and coalescent reconciliation scenarios. We draw several conclusions from this work. First, to obtain accurate estimates of HGT-based tree incongruence, poorly supported edges of reconstructed trees should be removed. Though an important task to conduct, removing (or contracting) poorly supported edges is not a straightforward task, since standard methods that are in common use for estimating branch support, such as bootstrapping and posterior probabilities in Bayesian analyses, have been shown to be overly “conservative” or “liberal” under various circumstances (Ruths and Nakhleh, 2006). Second, eliminating statistical error from reconstructed trees leads to non-binary trees, and hence phylogeny-based HGT detection methods should be designed to handle such trees (rather than focus on binary trees, which many existing tools do). Third, more than one maximally parsimonious solution (a solution that has the minimum number of HGT edges, or events, to explain the species and gene tree incongruence) may exist, and hence HGT detection methods should search for all such

JOURNAL OF COMPUTATIONAL BIOLOGY Volume 14, Number 4, 2007 © Mary Ann Liebert, Inc. Pp. 517–535 DOI: 10.1089/cmb.2007.A010 Confounding Factors in HGT Detection: Statistical Error, Coalescent Effects, and Multiple Solutions CUONG THAN,1 DEREK RUTHS,1 HIDEKI INNAN,2 and LUAY NAKHLEH1 ABSTRACT Prokaryotic organisms share genetic material across species boundaries by means of a process known as horizontal gene transfer (HGT). This process has great significance for understanding prokaryotic genome diversification and unraveling their complexities. Phylogeny-based detection of HGT is one of the most commonly used methods for this task, and is based on the fundamental fact that HGT may cause gene trees to disagree with one another, as well as with the species phylogeny. Using these methods, we can compare gene and species trees, and infer a set of HGT events to reconcile the differences among these trees. In this paper, we address three factors that confound the detection of the true HGT events, including the donors and recipients of horizontally transferred genes. First, we study experimentally the effects of error in the estimated gene trees (statistical error) on the accuracy of inferred HGT events. Our results indicate that statistical error leads to overestimation of the number of HGT events, and that HGT detection methods should be designed with unresolved gene trees in mind. Second, we demonstrate, both theoretically and empirically, that based on topological comparison alone, the number of HGT scenarios that reconcile a pair of species/gene trees may be exponential. This number may be reduced when branch lengths in both trees are estimated correctly. This set of results implies that in the absence of additional biological information, and/or a biological model of how HGT occurs, multiple HGT scenarios must be sought, and efficient strategies for how to enumerate such solutions must be developed. Third, we address the issue of lineage sorting, how it confounds HGT detection, and how to incorporate it with HGT into a single stochastic framework that distinguishes between the two events by extending population genetics theories. This result is very important, particularly when analyzing closely related organisms, where coalescent effects may not be ignored when reconciling gene trees. In addition to these three confounding factors, we consider the problem of enumerating all valid coalescent scenarios that constitute plausible species/gene tree reconciliations, and develop a polynomial-time dynamic programming algorithm for solving it. This result bears great significance on reducing the search space for heuristics that seek reconciliation scenarios. Finally, we show, empirically, that the locality of incongruence between a pair of trees has an impact on the numbers of HGT and coalescent reconciliation scenarios. Key words: phylogeny, horizontal gene transfer, coalescent. 1 Department 2 School of Computer Science, Rice University, Houston, Texas. of Advanced Sciences, Graduate University for Advanced Studies, Kanagawa, Japan. 517 518 THAN ET AL. 1. INTRODUCTION W HEREAS EUKARYOTES EVOLVE mainly though lineal descent and mutations, bacteria obtain a large proportion of their genetic diversity through the acquisition of sequences from distantly related organisms via horizontal gene transfer (HGT) (Doolittle et al., 2003; Ochman et al., 2000). There has been a large “ideological and rhetorical” gap between researchers who believe that HGT is so rampant that a prokaryotic phylogenetic tree is useless and those who believe that HGT is mere “background noise” which does not affect the reconstructibility of a phylogenetic tree for bacterial genomes. Supporting arguments for these two views have been published. For example, the heterogeneity of genome composition between closely related strains (e.g., in Escherichia coli only 40% of genes are shared in common by three E. coli strains [Welch et al., 2002]) supports the former view, whereas the well-supported phylogeny reconstructed by Lerat et al. (2003) from about 100 “core” genes in -Proteobacteria gives evidence in favor of the latter view. Nonetheless, regardless of the views and the accuracy of the various analyses, there is a consensus as to the occurrence of HGT and the evolutionary role it plays in bacterial genome diversification. Further, HGT is a main process by which bacteria develop resistance to antibiotics (Paulsen et al., 2003), is considered a primary explanation of incongruence among gene phylogenies, and is a significant obstacle to reconstructing the Tree of Life (Daubin et al., 2003). The HGT detection problem concerns the detection of the genes that are horizontally transferred into the genome, the donors and recipients of every horizontally transferred gene, and the number of HGT events that occurred during the evolutionary history of a set of species. When HGT occurs, the evolutionary history of the gene(s) involved does not necessarily agree with that of the species phylogeny. This observation is the fundamental basis of the phylogeny-based HGT detection approach: trees for individual genes are reconstructed (and sometimes a species tree is reconstructed as well, using other data), and their disagreements are identified to estimate the number (how many) as well as locations (donors and recipients) of HGT events. Beside the computationally challenging problem of quantifying disagreements among trees for the sake of detecting HGT, major challenges that face this approach include (1) determining whether the disagreements are indeed due to HGT, and (2) whether there is a unique HGT “scenario.” Yet, these two challenges encompass a host of issues of which we address three. First, since trees are at best partially known, they have to be reconstructed using a phylogeny reconstruction method. We investigate the impact that the quality of reconstructed trees has on HGT detection. Second, under the assumption that HGT is actually the source of tree disagreements, we investigate the uniqueness of a solution to the HGT detection problem, and establish bounds on the number of possible minimal HGT scenarios. Finally, among closely related species, lineage sorting due to random genetic drift may also cause tree incongruence, thus mimicking the effects of HGT on phylogenies. In this case, accurate HGT detection requires determining the actual cause of tree incongruities, and making the appropriate reconciliation. We make preliminary progress on incorporating HGT into the coalescent model, so as to produce a stochastic framework for classifying population-level events (such as lineage sorting) and species-level events (such as HGT). In addition to these three confounding factors, we consider the problem of enumerating all valid coalescent scenarios that constitute plausible species/gene tree reconciliations, and develop a polynomial dynamic programming algorithm for solving it. This result bears a great significance on reducing the solution space for heuristics that search for reconciliation scenarios. Finally, we show, empirically, that the locality of incongruence between a pair of trees has an impact on the numbers of HGT and coalescent reconciliation scenarios. We draw several conclusions from this work. First, to obtain accurate estimates of HGT-based tree incongruence, poorly supported edges of reconstructed trees should be removed. Though an important task to conduct, removing (or contracting) poorly supported edges is not a straightforward task, since standard methods that are in common use for estimating branch support, such as bootstrapping and posterior probabilities in Bayesian analyses, have been shown to be overly “conservative” or “liberal” under various circumstances (Ruths and Nakhleh, 2006). Second, eliminating statistical error from reconstructed trees leads to non-binary trees, and hence phylogeny-based HGT detection methods should be designed to handle such trees (rather than focus on binary trees, which many existing tools do). Third, more than one maximally parsimonious solution (a solution that has the minimum number of HGT edges, or events, to explain the species and gene tree incongruence) may exist, and hence HGT detection methods should search for all such CONFOUNDING FACTORS IN HGT DETECTION 519 solutions, unless additional biological information is given, or a model that simultaneously incorporates HGT, speciation, and extinction events (Kunin and Ouzounis, 2003). Finally, trees may be incongruent due to processes other than HGT; hence, classifying the sources of incongruence and reconciling them accordingly is imperative. 2. TREE INCONGRUENCE AND HGT DETECTION A gene tree is a model of how a gene evolves. As a gene at a locus in the genome replicates and its copies are passed on to more than one offspring, branching points are generated in the gene tree. Because the gene has a single ancestral copy, barring recombination, the resulting history is a branching tree (Maddison, 1997). Thus, within a species, many tangled gene trees can be found, one for each nonrecombined locus in the genome. Exploring incongruence among gene trees is the basis for phylogeny-based HGT detection and reconstruction. We illustrate some of the scenarios that may lead to gene tree incongruence in Figure 1. The species tree is represented by the “tubes”; it has A and B as sister taxa whose most recent common ancestor (MRCA) is a sister taxon of C . In the case of HGT, shown in Figure 1a, genetic material is transferred from one lineage to another. Sites that are not involved in a horizontal transfer are inherited from the parent while other sites are horizontally transferred from another species. Figure 1b gives an example of a gene tree that disagrees with the species phylogeny because of lineage sorting due to random genetic drift: the genes of B and C coalesced before their MRCA coalesced with the gene of species A. Moreover, sometimes multiple events “cancel out” one another’s effects when co-occurring in the same dataset; for example, in Figure 1c, lineage sorting “hides” the incongruence between the species and gene trees (tree topologies) that would have resulted from the HGT event. Another factor that may lead to gene and species tree disagreements is that trees reconstructed by phylogenetic methods may not be completely accurate (we refer to this as statistical error in the trees); hence, disagreements among trees due to such inaccuracies may trigger HGT “signal,” thus leading to overestimation of the actual HGT events. Notice that in the case of lineage sorting, the species phylogeny is still a tree, and the gene trees should be reconciled within its branches. However, in the case of HGT, the evolutionary history of the species genomes may not be represented by phylogenetic trees; rather, phylogenetic networks are the appropriate model (Moret et al., 2004; Kunin et al., 2005). The phylogeny-based HGT detection problem seeks the phylogenetic network with minimum number of reticulation nodes, e.g., HGT edges, to reconcile the species and gene trees. The minimization simply reflects a maximally parsimonious solution: in the absence of any additional biological knowledge, the simplest solution is sought. In the case, the simplest solution is one that invokes the minimum number of HGT events to explain tree incongruence. There has been a large body of work on this problem (Hallett and Lagergren, 2001; Nakhleh et al., 2004; Bordewich and Semple, 2005; Nakhleh et al., 2005; Makarenkov, 2001). FIG. 1. Gene tree that disagrees with the species tree due to HGT from C to B (a) and lineage sorting due to random genetic drift (b). (c) The effect of the HGT event (from B to C ) is “canceled out” by random genetic drift, resulting in congruent species and gene trees. 520 THAN ET AL. FIG. 2. (a) A phylogenetic network with a single HGT event from X to Y . (b) The underlying organismal (species) tree. (c) The tree of a horizontally transferred gene. 2.1. Terminology and definitions Let T D .V; E/ be a tree, where V and E are the tree nodes and tree edges, respectively, and let L.T / denote its leaf set. Further, let X be a set of taxa (species). Then, T is a phylogenetic tree over X if there is a bijection between X and L.T /. A tree T is said to be rooted if the set of edges E is directed and there is a single distinguished internal vertex r with in-degree 0. A phylogenetic network N D N.T / D .V 0 ; E 0 / over the taxa set X is derived from T D .V; E/ by adding a set „ of edges to T , where each edge h 2 „ is added as follows: (1) split an edge e 2 E by adding new node, ve ; (2) split an edge e 0 2 E by adding new node, ve0 ; (3) finally, add a directed HGT edge from ve to ve0 . In this case, we write N D T C „. Figure 2a shows a phylogenetic network obtained by adding a single HGT edge to the tree in Figure 2b. In a case where horizontal transfer of a single gene is involved, the edges e 0 that are split in step (2) above must be unique. In other words, no more than a single HGT edge may be incident into a single tree edge. However, this is not necessarily true if the phylogenetic network models the evolutionary history of multiple genes. In this case, an edge e 0 may be, for example, split twice, once because of an HGT involving gene g and another because of an HGT involving gene g0 . It is important to note that our definition of a phylogenetic network allows adding an HGT edge from a tree edge e to another tree edge e 0 “below” it. While this seems to violate biological constraints (such as the temporal co-existence of the donor and recipient), this case may arise in practice due, for instance, to incomplete taxon sampling or extinction. For a more thorough discussion of this issue, and modeling reticulate evolution in general, the reader is referred to Moret et al. (2004). Finally, we denote by T .N / the set of all trees contained inside network N . Each such tree is obtained by the following two steps: (1) for each node of in-degree 2, remove one of the incoming edges, and then (2) for every node x of in-degree and out-degree 1, whose parent is u and child is v, remove node x and its two adjacent edges, and add a new edge from u to v. For example, the two trees in Figure 2b,c are the only members of T .N /, where N is the network in Figure 2a. Definition 1. (The HGT Reconstruction Problem) Input: Species tree ST and gene tree GT. Output: Minimum-cardinality set „ of HGT edges, such that N D ST C „ and GT 2 T .N /. 3. THE EFFECT OF STATISTICAL ERROR ON HGT DETECTION In this section we investigate, through simulations, the effect of error in the reconstructed trees on the detection of HGT. In particular, we consider the minimum number of HGT events inferred by HGT detection methods, as well as the number of such maximally parsimonious solutions found by these methods. Experimental settings. We used the r8s tool (Sanderson, n.d.) to generate four random birth-death phylogenetic trees, Ti , i 2 f10; 25; 50; 100g, where i denotes the number of taxa in the tree. The r8s tool generates molecular clock trees; we deviated the trees from this hypothesis by multiplying each edge in the tree by a number randomly drawn from an exponential distribution. The expected evolutionary diameter CONFOUNDING FACTORS IN HGT DETECTION 521 (longest path between any two leaves in the tree) is 0.2. Then, from each model “species” tree Ti , we generated five different “gene” trees, Ti;j , j 2 f1; 2; 3; 4; 5g, where j denotes the number of subtree prune and regraft (SPR) moves applied to Ti to obtain Ti;j .1 The SPR moves were applied to simulated HGT events such that no cycles are created, and such that there is no redundancy (i.e., if k SPR moves were simulated to obtain T2 from T1 , then T2 cannot be obtained from T1 be fewer than k SPR moves). Beside these two requirements, the choice of the “donor” and “recipient” (i.e., the source and target of an HGT edge) were done purely randomly, without considerations to evolutionary distance between the two endpoints, or any other issues. For each Ti and Ti;j , i 2 f10; 25; 50; 100g and j 2 f1; 2; 3; 4; 5g, and for each sequence length ` ` 2 f250; 500; 1000; 2000; 4000; 8000g, we generated 30 DNA sequence alignments Si` Œk and Si;j Œk, 1 k 30, whose evolution was simulated down their corresponding trees under the GTRCCI (gamma distributed rates, with invariable sites) model of evolution, using the Seq-gen tool (Rambaut and Grassly, 1997). We used the parameter settings of Zwickl and Hillis (2002). Then, from each sequence alignment, we reconstructed a tree TNJ using the Neighbor Joining (NJ) method (Saitou and Nei, 1987), and another tree using a maximum parsimony heuristic as implemented in PAUP (Swofford, 1996). Since the maximum parsimony heuristic may return a set of optimal trees, for each alignment we only considered the strict consensus of each such set, and referred to that as the tree TMP. At the end of this process we had 4 trees Ti , 20 trees Ti;j , 720 NJ trees TNJ `i Œk, 3600 NJ trees TNJ `i;j Œk, 720 MP trees TMP`i Œk, and 3600 MP trees TMP`i;j Œk (i 2 f10; 25; 50; 100g, j 2 f1; 2; 3; 4; 5g, 1 k 30, and ` 2 f250; 500; 1000; 2000; 4000; 8000g). To compute minimal HGT scenarios as well as the number of such scenarios, we applied two methods to pairs of species and gene trees: LatTrans (Hallett and Lagergren, 2001; Addario-Berry et al., 2003) and RIATA-HGT (Nakhleh et al., 2005), which has been recently extended and improved to compute multiple minimal solutions (Jin et al., 2007; the original method is described in Nakhleh et al., 2005), computes only a single minimal solutions, since the emphasis of the underlying algorithm was on estimating the minimum number of HGT events). There are several methods for recovering candidate HGT edges based on comparing a pair of trees (Hallett and Lagergren, 2001; Makarenkov, 2001; Addario-Berry et al., 2003; Nakhleh et al., 2005; MacLeod et al., 2005; Beiko and Hamilton, 2006). In this work, we did not intend to study or compare the performance of these methods, but rather to try to quantify and understand the effect of various factors on the estimation of the number of HGT events. For this purpose, we chose two different methods: LatTrans and RIATA-HGT. Since both methods are heuristics, independently developed, and their relative performance is unknown (in terms of accuracy), we used both to ensure that the effects measured reflect general trends, rather than issues specific to a particular heuristic. Indeed, the similar trends observed in the experiments raise certain points that seem to be independent of the heuristic used. Both tools were applied to three different types of pairs of trees. Type I pairs .Ti ; Ti;j /: in this case, the species and gene trees are assumed to be correct. Type II pairs .Ti ; TNJ `i;j Œk/ and .Ti ; TMP`i;j Œk/: in this case, the species tree is correct, and the gene trees are estimated (using NJ and MP, respectively). Type III pairs .TNJ `i Œk; TNJ `i;j Œk/ and .TMP`i Œk; TMP`i;j Œk/: in this case, both the species and gene trees are inferred. The goal of running the methods in these different ways is to estimate the error due to inaccuracy in the different trees. Due to space limitations, we only show results using NJ trees, 25-taxon trees (Since LatTrans cannot handle non-binary trees, it was not run on MP trees, and from RIATA-HGT’s results on MP trees, the trends on MP trees are very similar). In each run of a tool on a pair of trees, we computed two values: the number of inferred HGT events, and the number of such scenarios (or solutions) found by the method. In Type II and Type III pairs, we report the average of all 30 runs for each combination of i , j , and `. 1 An SPR move simulates an HGT event. 522 THAN ET AL. 3.1. The effect of statistical error on estimating the number of HGT events Both LatTrans and RIATA-HGT computed the correct number of SPR moves (i.e., HGT edges) when applied to Type I pairs. In other words, when both the species and gene trees were correct, both methods made an accurate estimation of the number of HGT events. The performance of both methods, in terms of the number of inferred HGT events, on Type II and Type III pairs of trees is shown in Figure 3. Figures 3a and 3b show that, when the species tree is accurate, and the gene tree is inferred, both methods accurately estimate the number of HGT events for the case of five HGT events when the sequences are of length 8000. They overestimate the number for all other cases, at all sequence lengths. As the sequence length increases, the trees inferred by NJ become more accurate, since NJ is statistically consistent (Atteson, 1999), and hence the improvement in the performance of the methods as the sequence length increases. At sequence length 250, the methods have the worst performance. When both the species and gene trees are inferred, the overestimation becomes larger, as shown in Figures 3c and 3d. In this case, even at sequence length 8000 the methods do overestimate the actual number of HGT events. It is worth noting that both methods have almost identical performance in terms of the number of HGT events inferred (RIATA-HGT does slightly better in some cases at sequence length 1000). However, RIATA-HGT is orders of magnitude faster. Figure 4 shows the relative performance (in terms of actual running time) of the two methods on 25-taxon NJ trees. Specially, LatTrans took several days on each pair of 50-taxon trees, and for sequence length 250 it crashed after 4 days without returning results. Given that the two methods accurately estimated the number of HGT events in Type I pairs of trees, i.e., accurate species and gene trees, the results show that error in inferred trees (one or both) leads to overestimation of the number of HGT events. The overestimation is even larger for the larger data FIG. 3. The number of HGT events inferred by LatTrans and RIATA-HGT, as a function of the sequence length. Each curve corresponds to one of the five actual numbers of HGT events: ?, one HGT; 4, two HGTs; C, three HGTs; , four HGTs; and ı, five HGTs. 25-taxon trees inferred using NJ. CONFOUNDING FACTORS IN HGT DETECTION 523 FIG. 4. The performance of LatTrans and RIATA-HGT in terms of actual running times (in seconds), as functions of the sequence length. Each curve corresponds to one of the five actual numbers of HGT events: ?, one HGT; 4, two HGTs; C, three HGTs; , four HGTs; and ı, five HGTs. sets (50- and 100-taxon trees). Therefore, it is important to eliminate statistical error from trees before estimating HGT events. Ruths and Nakhleh (2006) have studied the performance of various methods for eliminating incorrect edges while maintaining accurate ones. This elimination, in the form of contracting poorly supported edges, may lead to non-binary trees in certain cases, which cannot be handled by LatTrans, although they can be handled by RIATA-HGT. 4. THE UNIQUENESS OF HGT SCENARIOS Moret et al. (2004) showed that a phylogenetic network that reconciles two trees need not be unique, by showing two phylogenetic networks, each with a single reticulation event, that reconcile the same pair of trees. Further, they showed how branch lengths could be used to resolve the non-uniqueness question in this simple case. Here we show that the number of possible maximally parsimonious (with minimum number of HGT events) phylogenetic networks that reconcile a pair of trees may actually be exponential. Further, we discuss when branch lengths may not be sufficient to resolve the non-uniqueness issue. The number of maximally parsimonious HGT scenarios that reconcile a pair of trees (species and gene trees, for example) may be exponentially large, as illustrated in Figure 5. The species and gene trees in the figure, ST and GT, respectively, contain 3k leaves and differ in that Xi 2 is closer to Xi1 than to Xi 3 in tree ST, and closer to Xi 3 than to Xi1 in tree GT, for 1 i k. For every triplet hXi1 ; Xi 2; Xi 3 i of taxa, one of three HGT edges is needed to reconcile the difference in topologies of the triplet based on the two trees ST and GT: (1) the edge Hi1 W Xi 3 ! Xi 2 , (2) the edge Hi 2 W Xi 2 ! Xi 3 , or (3) the edge Hi 3 W mi ! Xi1 , where mi is the edge incoming into the most recent common ancestor (node) of the triplet of taxa; these three scenarios are shown in Figure 6. To reconcile the differences among all k FIG. 5. A species tree ST and a gene tree GT with 3k leaves. The two trees differ in k places: the species tree has Xi1 and Xi 2 as siblings, whereas the gene tree has Xi 2 and Xi 3 as siblings (1 i k). There are 3k maximally parsimonious HGT scenarios that reconcile the two trees. 524 THAN ET AL. FIG. 6. The three possible scenarios for reconciling the topologies of the triplet hXi1 ; Xi 2 ; Xi 3 i based on the species and gene trees, ST and GT, respectively, in Figure 5. triplets, there are 3k HGT scenarios, since there are k triplets to reconcile, and for each triplet there are three possible reconciliations. Two observations are in order. First, since the donor and recipient of a gene have to co-exist in time (Moret et al., 2004), and given that the topology of a phylogeny defines a partial order on the set of extant and ancestral taxa (ancestral taxa precede their descendants in this partial order), it follows that edge Hi 3 can be part of an HGT solution only if certain taxa went extinct or were not sampled. This case is illustrated in Figure 6, where the dashed line represents the lineage for taxon Xi which is not present in the set of taxa under consideration but whose existence must be invoked to explain the HGT edge Hi 3 . Let ıST and ıGT be the pairwise distance matrices of the set of taxa based on the species and gene trees ST and GT, respectively, in Figure 5, and let us consider the triplet of taxa in Figure 6. There are three cases. (1) The scenario Hi1 is plausible if and only if ıST .Xi1 ; Xi 3/ ıGT .Xi1 ; Xi 3/ and ıST .Xi1 ; Xi 2/ 6 ıGT .Xi1 ; Xi 2 /. (2) The scenario Hi 2 is plausible if and only if ıST .Xi1 ; Xi 2 / ıGT .Xi1 ; Xi 2 /. (3) The scenario Hi 3 is plausible if and only if If ıST .Xi 2 ; Xi 3 / ıGT .Xi 2 ; Xi 3 /. Since the conditions in the three cases are mutually exclusive, it follows the branch lengths, when estimated accurately, can be used to correctly resolve the non-uniqueness issue in this case. However, estimating branch lengths to a high degree of accuracy such that the above three cases are distinguished accurately is a very challenging task. Further, even if branch lengths are estimated accurately, if the evolutionary distance between the donor and recipient is very small, distinguishing among the cases becomes more challenging. 4.1. On the number of minimal HGT scenarios: theoretical results In this section, we derive an upper bound on the number of minimal HGT scenarios for reconciling two trees. Given two trees ST and GT, we denote by ST;GT the number of HGT edges in any solution to the HGT Reconstruction Problem, and by N ST;GT the set of all solutions. When the context is clear, we omit the tree names from the superscript. Given a species tree ST and a gene tree GT, each with n leaves, there are O.n2 / different HGT edges that can be added to it. If the cardinality of a minimal solution to the HGT Reconstruction Problem on the pair .ST; GT/ is k, then there can be at most O.n2k / solutions of size k. We now provide a tighter upper bound on the number of solutions, and show a pair of trees for which this upper bound is exact. Theorem 1. For any pair of trees, ST and GT, we have jN j 3 : Proof. We prove the theorem by induction on . For the base case, let D 1. Then, there are two subtrees t1 and t2 that are siblings in GT but not siblings in ST. There are two cases: (1) on the undirected path from the root of t1 to t2 there are exactly two nodes (excluding the roots of t1 and t2 ), or (2) on that path, excluding the roots, there are more than two nodes. Notice that there has to be at least one node on the path, which is the least common ancestor of the two subtrees. In the first case, there is exactly one subtree t 0 that lies between t1 and t2 in the species tree (Fig. 7). There are three possible ways, in this case, to make t1 and t2 siblings, and these correspond to the three HGT edges (numbered 1, 2, and 3) in Figure 7. Hence, there are three solutions to the HGT reconstruction problem. In the second case, only one solution is possible, since the location of the node in GT whose two children are the roots of t1 and CONFOUNDING FACTORS IN HGT DETECTION 525 FIG. 7. Illustration of case (1) in the base case of the inductive proof of Theorem 1. The tree is denoted by the solid lines, and its leaves are the leaves of subtrees t1 , t2 , and t3 . Three phylogenetic networks, each of which contains a single HGT edge and induces the same pair of trees, are obtained by adding one of the tree HGT edges, denoted by arrows 1, 2, and 3. While HGT edge 3 may indicate violation of the temporal co-existence of the donor and recipient of a gene, this may be possible in practice due to, for instance, incomplete taxon sampling or extinction (for further details, see Section 2.1). t2 is fixed, given that there is more than one subtree between these two subtrees. Hence, for D 1, we have jN j 3. For the induction step, assume the theorem holds for any pair of trees T1 and T2 , where T1 ;T2 < k, 0 and let ST and GT be two trees such that ST;GT D k. Then, there exists a tree T 0 such that ST;T D k 1 0 0 0 and T ;GT D 1. By the induction hypothesis, we have jN ST;T j 3k 1 and jN T ;GT j 3. Any network 0 0 N 2 N .ST; GT/ can be written as N D ST C .„0 C fhg/, where ST C „0 2 N ST;T and ST C fhg 2 N T ;GT . ST;GT k 1 1 k Hence, jN j3 3 D 3 . Figure 5 shows two trees ST and GT where jN j D 3 ; i.e., the two trees achieve the maximum number of solutions. 4.2. On the number of minimal HGT scenarios: empirical results In our simulation study (using the same experimental setup described in Section 3), we looked at the number of maximally parsimonious solutions that were computed by LatTrans and RIATA-HGT; the results for 25-taxon NJ trees are shown in Figure 8. All fours graphs show that, regardless of whether the actual or inferred species trees are used, both methods estimate a large number of maximally parsimonious solutions. The figures show that the number decreases as the sequences used become longer. When we ran the methods on the actual trees (Type I pairs of trees), both of them returned single solutions. A plausible conclusion is that as the amount of statistical error in the inferred trees increases, so does the number of maximally parsimonious solutions. The reason for this is that for shorter sequence lengths, the accuracy of the trees is poorer, i.e., they have more wrong edges. These wrong edges give an indication of more HGT events. This indication, though false, leads to larger numbers of solutions since more reconciliations become possible. Notice that the trends of the curves in Figure 8 are similar to those of the corresponding curves in Figure 3. This reflects the correlation between the number of HGT events in a minimal solution and the number of such minimal solutions, as stated in Theorem 1. However, at the same time, the proof of Theorem 1 illustrates cases where multiple solutions may exist (e.g., Figure 7 illustrates three “equivalent” HGT edges, that arise under a very specific case of incongruence between the species and gene trees). This indicates, that in some cases, even though the number of HGT events is smaller for case X than for case Y, it may be that the number of solutions for case X is larger than for case Y. This is the reason, for example, why the trend of the curve for the number of solutions in the case of three HGTs in Figure 8d does not match the trend of the number of HGT evens in the same case in Figure 3d. To illustrate this point further, the number of minimal HGT scenarios for the pair of trees in Figure 5 is 3k . On the other hand, it is straightforward to devise an example of a pair of trees, where the minimum number of HGT edges required to reconcile them is k C 1, and the number of solutions is 1. In this scenario, the case of fewer HGT edges gives rise to exponentially more minimal solutions than the case of more HGT edges. 526 THAN ET AL. FIG. 8. The number of minimal HGT scenarios inferred by LatTrans and RIATA-HGT as a function of the sequence length. Each curve corresponds to one of the five actual numbers of HGT events: ?, one HGT; 4, two HGTs; C, three HGTs; , four HGTs; and ı, five HGTs. 25-taxon trees inferred using NJ. An important conclusion is that, in the absence of an evolutionary model of HGT, phylogeny-based HGT detection methods should be designed to compute “all” possible solutions. As illustrated in Figure 5, the number of such solutions may be exponential, though. A measure that assigns support to these solutions is imperative, so that they can be rank ordered. 5. INCORPORATING HGT INTO THE COALESCENT As we described in Section 2, phylogenetic incongruence may occur due to various processes, of which HGT is only one. Another such process is lineage sorting, whose effect and confusing signal to HGT detection is particularly important when analyzing genes of closely related organisms. In this section, we augment the coalescent model by incorporating HGT, thus providing a framework for stochastically distinguishing among these two processes as the actual source of phylogenetic incongruence. Lineage sorting occurs because of random contribution of each individual to the next generation. Some fail to have offsprings while some happen to have multiple offsprings. In population genetics, this process was first modeled by R.A. Fisher and S. Wright, in which each gene of the population at a particular generation is chosen independently from the gene pool of the previous generation, regardless of whether the genes are in the same individual or in different individuals. Under the Wright-Fisher model, “the coalescent” considers the process backward in time (Kingman, 1982; Hudson, 1983b; Tajima, 1983). That is, the ancestral lineages of genes of interest are traced from offsprings to parents. A coalescent event occurs when two (or sometimes more) genes are originated from the same parent, which is called the most recent common ancestor (MRCA) of the two genes. CONFOUNDING FACTORS IN HGT DETECTION 527 The basic process can be treated as follows. Consider a pair of genes at time 1 in a random mating haploid population. The population size at time is denoted by N./. The probability that both genes are from the same parental gene at the previous generation (time 1 C 1) is 1=N.1 C 1/. Therefore, starting at 1 , the probability that the coalescence between the pair occurs at 2 is given by 1 Prob.2 / D N.2 / 2 1 Y 1 1 : N./ D1 C1 (1) When N./ is constant, the probability density distribution (pdf) of the coalescent time (i.e., t D 2 1 ) is given by a geometric distribution, and can be approximated by an exponential distribution for a large N : Prob.t/ D 1 e N t =N : (2) The coalescent process is usually ignored in phylogenetic analysis, but has a significant effect (causing lineage sorting) when closely related species are considered (Hudson, 1983a; Takahata, 1989; Rosenberg, 2002). The situation of Figure 1b is reconsidered under the framework of the coalescent in Figure 9. Here, it is assumed that species A and B split T1 D 5 generations ago, and the ancestral species of A and B and species C split T2 D 19 generation ago. The ancestral lineage of a gene from species A and that from B meet in their ancestral population at time D 6, and they coalesce at D 33, which predates T2 , the speciation time between .A; B/ and C . The ancestral lineage of B enters in the ancestral population of the three species at time D 20, and first coalesces with the lineage of C . Therefore, the gene tree is represented by A.BC / while the species tree is .AB/C . That is, the gene tree and species tree are “incongruent.” Under the model in Figure 9, the probability that the gene tree is congruent with the species tree is 0.863, which is one minus the product of the probability that the ancestral lineages of A and B do not coalesce between D 6 and D 9, and the probability that the first coalescence in the FIG. 9. An illustration of the coalescent process in a three species model with discrete generations. The process is considered backward in time from present, T0 , to past. Circles represent haploid individuals. We are interested in the gene tree of the three genes (haploids) from the three species. Their ancestral lineages are represented by closed circles connected by lines. A coalescent event occurs when a pair of lineages happen to share a single parental gene (haploid). 528 THAN ET AL. ancestral population of the three species occurs between (A and C ) or (B and C ). The former probability 15 14 12 11 10 8 is 16 .1 87 /8 D 0:26 and the latter is 23 . 15 13 12 11 9 Under the three-species model (Fig. 9), there are three possible types of gene tree, .AB/C , .AC /B and A.BC /. Let ProbŒ(AB)C, ProbŒ(AC)B and ProbŒA(BC) be the probabilities of the three types of gene tree. These three probabilities are simply expressed with a continuous time approximation when all populations have equal and constant population sizes, N , where N is large: ProbŒ(AB)C D 1 2 e 3 .T2 T1 /=N ; (3) and ProbŒ(AC)B D ProbŒA(BC) D 1 e 3 .T2 T1 /=N : (4) Figure 10a shows the three probabilities as functions of .T2 T1 /=N . It is important to notice that the estimation of the gene tree from DNA sequence data is based on the nucleotide differences between sequences, and that the gene tree is sometimes unresolved. One of the reasons for that is a lack of nucleotide differences such that DNA sequence data are not informative enough to resolve the gene tree. This possibility strongly depends on the mutation rate. Let be the mutation rate per region per generation, and consider the effect of mutation on the estimation of the gene tree. We consider the simplest model of mutations on DNA sequences, the infinite site model (Kimura, 1969), in which mutation rate per site is so small that no multiple mutations at a single site are allowed. Consider a gene tree, .AB/C , and suppose that we have a reasonable outgroup sequence such that we know the sequence of the MRCA of the three sequences. It is obvious that mutations on the internal branch between the MRCA of the three and the MRCA of A and B are informative. If at least one mutation occurred on this branch, the gene tree can be resolved from the DNA sequence alignment. This effect is investigated by assuming that the number of mutations on a branch with length t follows a Poisson distribution with mean t. Figure 10b shows the probability that the gene tree is resolved; T2 T1 D 0:5N generations is assumed so that the probability that the gene tree is .AB/C is about 0.6. As expected, as the mutation rate increases, the probability that the gene tree is resolved from the sequence alignment increases, and this probability exceeds 90% when N > 1:52. Similar results are obtained for the other two types of trees, .AC /B and A.BC /, that appear with probability 0.2 for each (Fig. 10b). Thus far, we have shown that the gene tree is not always identical to the species tree even considering vertical evolution. With keeping this in mind, let us consider the effect of horizontal gene transfer (HGT) FIG. 10. (a) The probabilities of the three types of gene tree, (AB)C, (AC)B, and A(BC), as functions of .T2 T1 /=N . (b) The probabilities that the gene tree is resolved from DNA sequence data. The probabilities are given as functions of the mutation rate for the three types of tree, (AB)C, (AC)B, and A(BC), when .T2 T1 /=N D 0:5. The white regions represent the probabilities that the gene tree is not resolved. CONFOUNDING FACTORS IN HGT DETECTION 529 on the gene tree under the framework of the coalescent. The application of the coalescent theory to bacteria is straightforward. Rather than the Wright-Fisher model, bacterial evolution may be better described by the Moran model, which handles overlapping generations well. Suppose that each haploid individual in a bacterial population with size N has a lifespan that follows an exponential distribution with mean l . When an individual dies, another individual randomly chosen from the population replaces it to keep the population size constant. In other words, one of the N 1 alive lineages is duplicated to replace the dead one. Under the Moran model, the ancestral lineages of individuals of interest can be traced backward in time, and the coalescent time between a pair of individuals follows an exponential distribution with mean lN=2 (Ewens, 1979; Rosenberg, 2005). This means that one half of the mean lifetime in the Moran model corresponds to one generation in the Wright-Fisher model. It may usually be thought that HGT can be detected when the gene tree and species tree are incongruent (see Section 2). However, the situation is complicated when lineage sorting is also involved. Consider a model with three species, A, B, and C , in which an HGT event occurs from species B to C . Suppose the ancient circular genome has a single copy of a gene as illustrated in Figure 11a. Let a, b and c be the focal orthologous genes in the three species, respectively. At time Th , a gene escaped from species B and was inserted in a genome in species C at Ti , which is denoted by c 0 . Since HGT is assumed to be instantaneous at the scale of evolution, in reality, it is always the case that Ti D Th . However, since these times are estimated in practice, it may be the case that Th < Ti . For example, if a gene duplication occurs in lineage b in Figure 11a, and one of the two in-paralogs is transferred to c, then the estimated time Th would be the duplication time, which is earlier than the actual time of the HGT events, Ti . Following the HGT event, c was physically deleted from the genome, so that each of the three species currently has a single copy of the focal gene. If there is no lineage sorting, the gene tree should be a.bc 0 /. Since this tree is incongruent with the species tree, .AB/C , we could consider it as an evidence for HGT. However, as shown in Section 2, lineage sorting could also produce the incongruence between the gene tree and species tree without HGT. It is also important to note that lineage sorting, coupled with HGT, could produce a congruent gene tree, as illustrated in Figure 11a. Although b and c 0 have a higher chance to coalesce first, the probability that the first coalescence occurs between a and b or between a and c 0 may not be negligible especially when T1 Th is short. The probabilities of the three types of gene tree can be formulated under this tri-species model with HGT as illustrated in Figure 11a. Here, Th could exceed T1 , in such a case it can be considered that HGT occurred before the speciation between A and B. Assuming that all populations have equal (constant) population sizes, N , the three probabilities can be obtained modifying (3) and (4): 8̂ 1 .T1 ˆ <3e ProbŒ(AB)C D ˆ :̂1 2 e 3 ProbŒ(AC)B D 8̂ 1 ˆ <3e ˆ1 :̂ e 3 Th /=N if Th T1 , .Th T1 /=N .T1 Th /=N if Th > T1 if Th T1 , .Th T1 /=N (5) (6) if Th > T1 and ProbŒA(BC) D 8̂ ˆ <1 ˆ1 :̂ e 3 2 e 3 .T1 Th /=N .Th T1 /=N if Th T1 . if Th > T1 Figure 11b shows the three probabilities assuming T1 D 2N and T2 D 3N . (7) 530 THAN ET AL. FIG. 11. (a) A three bacterial species model with an HGT event. A demonstration that a congruent tree could be observed even with HGT. (b) The probabilities of the three types of gene tree, .ab/c 0, .ac 0 /b, and a.bc 0/, as functions of Th =N . T1 D 2N and T2 D 3N are assumed. 5.1. Enumerating valid coalescent scenarios As mentioned above, a coalescent event occurs when two (or sometimes more) genes are originated from the same parent, which is called the most recent common ancestor (MRCA) of the two genes. A valid coalescent scenario for a gene tree GT and a species tree ST is a list of coalescence events in the gene tree together with the edges of the species tree on which they occur (Degnan and Salter, 2005; Rosenberg, 2007). For example, I and F in the gene tree of Figure 12 cannot coalesce at edge 2 in the species tree, and hence any coalescent scenario in which I and F coalesce at that edge is invalid. On the other hand, there are valid coalescent scenarios in which all taxa coalesce at edge 3. Rosenberg (2007) has recently provided a closed-form formula, rather than an algorithm, for computing the number of all valid coalescent scenarios of a gene within the branches of a species tree. We now provide a polynomial-time dynamic programming algorithm for enumerating all valid coalescent scenarios, given a pair of species/gene trees, over the same set of taxa, but not necessarily bifurcating or having the same topology. The divide-and-conquer nature of our algorithm allows for computing sharing among the many (potentially exponential) valid coalescent scenarios, and hence for more efficient implementation to list these scenarios. We first start with some definitions and terminology. Let T be a rooted tree leaf-labeled by a set X of taxa; i.e., there is a bijection f W L.T / ! X , where L.T / denotes the set of the tree leaves. The set E.T / denotes the set of all internal edges (including an edge incoming into the root, which we denote by re). Further, the tree edges are labeled via a post-order numbering; i.e., there is a bijection hT W E.T / ! f1; : : : ; n 1g, where n D jX j, and hT respects a FIG. 12. Illustration of incongruent species/gene trees. The three numbered edges in tree ST are the only elements of the set E, which is the set of all species tree edges on which at least one cluster of taxa from the gene tree can coalesce at. CONFOUNDING FACTORS IN HGT DETECTION 531 post-order labeling. Tree T induces a set CT D fceT W ceT X ; e 2 E.T /g of clusters of taxa, where ceT is the set of all leaves in X which are “under” edge e in T . The topology of the tree T naturally defines a partial order T on C (indeed, the topology of T is the Hasse diagram of this partial order). Definition 2. Given two trees, ST and GT, over the same set X of taxa, a (valid) coalescent history is a mapping ˛ W CGT ! E.ST/, such that: ST 1. For every X 2 CGT , X c˛.X / , and 2. For every two edges e1 ; e2 2 E.GT/, if hGT .e1 / < hGT .e2 / then hST .˛.e1 // hST .˛.e2 //. Condition (1) of Definition 2 states that a cluster X of taxa can coalesce on branches of the species tree that are “above” the most recent common ancestor (MRCA) of X in the tree. Condition (2) of Definition 2 states that the coalescent events of the gene within the branches of the species tree must respect the gene tree topology. Let ST and GT be the species and gene trees, respectively. Let the edges of ST be numbered as above, and let C be the set of all clusters (of size 2) of taxa in the tree GT. We write v D caST .x/, for x 2 C , to denote that v is a common ancestor (node) of the set x of taxa in tree ST. We write InEdge.v/ to denote the edge incident into node v (in v’s tree). The set of all edges in ST on which any cluster in C can coalesce at is E D fInEdge.v/ W v D caST .x/; x 2 C g: For example, E D f1; 2; 3g for the species and gene trees in Figure 12. We say that e 2 E is a “lowest” edges if there does not exist any other edge e 0 2 E such that e lies on the path from the root of the tree to e 0 . We define children of a cluster c 2 C as Children.c/ D fc 0 2 C W c 0 c; and 6 9c 00 2 C s:t: Œc 0 00 c ^ c 00 cg: Notice that Children.c/ induces a partition of c. In other words, for every c1 ; c2 2 C , c1 ¤ ;, c1 \ c2 D ;, and [c 0 2C c 0 D c. With every cluster c 2 C , we associate the set pc , which is pc D fInEdge.v/ W v D caST .c/g: In other words, pc is actually the path of edges in the species tree on which the taxa of cluster c could coalesce at. For example, for cluster c D ABD in Figure 12 we have pc D f1; 3g. For every cluster c and edge e, we define c .e/ to be the total number of all valid coalescence scenarios of leaves in c when the MRCA of all leaves in c lies on edge e. The recursive algorithm, Compute, outlined in Figure 13, computes the values of c .e/ for a cluster c 2 C and edge e 2 E. The recursion can be eliminated in a straightforward manner if the computation of is done in a bottom-up fashion. The number of valid coalescent scenarios, given a species tree ST and a gene tree GT is Compute(ST, GT, L.ST/, re). We illustrate the algorithm on the trees shown in Figure 12. The results are shown in Table 1. The following two theorems establish the correctness and running time of computing the number of valid coalescent scenarios of a pair of species and gene trees. FIG. 13. Algorithm Compute for computing the value c .e/ for a cluster c 2 C and an edge e 2 E, given a species tree ST and a gene tree GT. 532 THAN ET AL. TABLE 1. C LUSTERS Cluster c 1: 2: 3: 4: 5: 6: 7: 8: BD ABD CE IF CEIF ABDCEIF GH ABCDEFGHI IN C, AND THE VALUES OF pc AND c .e/ FOR THE T REES IN F IGURE 12 Set pc c .e/ 8e 2 E 1,3 1,3 1,3 3 3 3 2,3 3 1 .1/ D 1 .3/ D 1; 1 .2/ D 0 2 .1/ D 1; 2 .2/ D 0; 2 .3/ D 1 .1/ C 1 .3/ D 2 3 .1/ D 3 .3/ D 1; 3 .2/ D 0 4 .1/ D 4 .2/ D 0; 4 .3/ D 1 5 .1/ D 5 .2/ D 0; 5 .3/ D .3 .1/ C 3 .3//.4 .3// D 2 1 D 2 6 .1/ D 6 .2/ D 0; 6 .3/ D .2 .1/ C 2 .3//.5 .3// D 3 2 D 6 7 .1/ D 0; 7 .2/ D 7 .3/ D 1; 8 .1/ D 8 .2/ D 0; 8 .3/ D .6 .3//.7 .2/ C 7 .3// D 6 2 D 12 The total number of valid coalescent scenarios is 12, which is 8 .3/. Theorem 2. Compute(ST, GT, L.ST/, re) is the number of valid coalescent scenarios of the species tree ST and the gene tree GT. When the recursion in algorithm Compute is replaced by bottom-up computation, the algorithm takes O.n2 / time, since for each cluster c, the algorithm considers only its children (whose values have already been computed). Given that there are O.n/ clusters in the gene tree (each defined by an internal edge in the tree), and that for each cluster there may be O.n/ children, the computation can be achieved in O.n2 / time. Theorem 3. jL.GT/j. Compute(ST, GT, L.ST/, re) can be computed in O.n2 / time, where n D jL.ST/j D 5.2. Coalescent versus HGT scenarios Experimental settings. In this experiment we compared the number of valid coalescent and minimal HGT scenarios returned by the algorithm described in Section 5.1 and the extended RIATA-HGT heuristic, respectively, for a range of numbers and diameters of simulated incongruence events.2 The diameter of an incongruence event is the number of tree edges separating the two endpoints of the SPR move to which it corresponds. This quantity reflects the “locality” of the incongruities between the two trees. For each number n 2 f1; 2; 3; 4g and diameter d 2 f2; 4; 6; 8g of simulated incongruence events, we generated twenty pairs ht; t 0 i of 20-taxon trees, where t was generated using the r8s tool (Sanderson, n.d.), and t 0 was obtained from t by simulating n random incongruence events, each of diameter d . The coalescent algorithm and extended RIATA-HGT were run on each such pair within the data set and the numbers of solutions computed were averaged to give a single data point for each method and combination of diameter/number of incongruence events. The results of this analysis are shown in Figure 14. Results and discussion. Figure 14 clearly shows that there is a correlation between the diameter of incongruence events and the number of valid coalescent and minimal HGT scenarios. Small diameters reflect that the incongruence occurs between two very close taxa, whereas large diameters reflect incongruence between two very distant taxa. As the diameter gets larger, the number of edges between the MRCA of taxa and the root becomes smaller, and hence we would expect the number of valid coalescent scenarios to become smaller. And this is exactly what we see in the figure. For diameter of 2, the number of such scenarios is over 200 million. However, there is a sharp decrease in that number as the diameter increases. On the other hand, even though we see a similar trend in the decrease of number of HGT scenarios as the diameter increases, the actual number of minimal HGT scenarios is drastically much smaller. Even for the 2 We simulated an incongruence event by a subtree prune and regraft (SPR) move, where these moves were added with certain restrictions, as described in Section 3. CONFOUNDING FACTORS IN HGT DETECTION 533 FIG. 14. The numbers of valid coalescent scenarios (a) and minimal HGT scenarios (b) as a function of the diameter of the incongruence events. Each curve corresponds to datasets with one, two, three, or four incongruence events. smallest diameter of 2, the number of minimal HGT scenarios is about 33, and as the diameter increases, the number of solutions converges to 1. A significant observation from Figure 14b is that we either have small diameter and large number of solutions, or a large diameter and small number of solutions. This observation proves crucial to the improvements achieved by the algorithmic techniques described in the previous section, since small diameters indicate more pairs in the decomposition, and hence more efficiency, and small number of solutions for large diameters indicate very small equivalence classes in large components of the decomposition. 6. CONCLUSION In this paper, we showed that error in inferred trees has a negative impact on the estimates made by phylogeny-based HGT detection methods. These results provide a set of conclusions. First, to obtain accurate estimates of HGT based on tree incongruence, poorly supported edges of reconstructed trees should be removed. Though an important task to conduct, removing (or contracting) poorly supported edges is not a straightforward task, since standard methods that are in common use for estimating branch support, such as bootstrapping and posterior probabilities in Bayesian analyses, have been shown to be overly “conservative” or “liberal” under various circumstances (Ruths and Nakhleh, 2006). Second, more than one maximally parsimonious solution (a solution that has the minimum number of HGT edges, or events, to explain the species and gene tree incongruence) may exist, and hence HGT detection methods should search for all such solutions. In this preliminary work, we have studied the effect of error in inferred trees on the accuracy of HGT detection methods, both in terms of the minimum number of events computed as well as the number of such minimal solutions. One of our immediate goals is to study the performance of these methods in terms of the locations (donors and recipients) of inferred HGT; for this task, we will use the distance measures proposed in Moret et al. (2004). Further, we will study the effects of the aforementioned factors, using both simulated and biological data, on the performance of several currently available tools for HGT detection (Hallett and Lagergren, 2001; Makarenkov, 2001; Addario-Berry et al., 2003; Nakhleh et al., 2005; MacLeod et al., 2005; Beiko and Hamilton, 2006). Further, lineage sorting due to the coalescent process acts as a noise for detecting and reconstructing HGT based on tree incongruence, sometimes mimicking the evidence for HGT and sometimes concealing evidence of HGT. Therefore, to distinguish HGT and lineage sorting, a stochastic framework based on the theory introduced in Section 5 is needed. We only considered very simple cases with three species here, and we will extend the theory to more general cases. Finally, we designed a polynomial-time dynamic programming algorithm for enumerating all valid coalescent scenarios that reconcile a pair of species and gene trees. This algorithm may be used as a core component in statistical methods for reconciling species and gene trees. 534 THAN ET AL. ACKNOWLEDGMENTS We are grateful to Noah Rosenberg for comments and discussion of the problem of enumerating valid coalescent scenarios. We would like to thank the three anonymous reviewers for their valuable comments. This work is supported in part by the Department of Energy grant DE-FG02-06ER25734 (Luay Nakhleh), the National Science Foundation grant CCF-0622037 (Hideki Innan and Luay Nakhleh), and the George R. Brown School of Engineering Roy E. Campbell Faculty Development Award (Luay Nakhleh). REFERENCES Addario-Berry, L., Hallett, M.T., and Lagergren, J. 2003. Towards identifying lateral gene transfer events. Proc. 8th Pacific Symp. Biocomput., 279–290. Atteson, K. 1999. The performance of the neighbor-joining methods of phylogenetic reconstruction. Algorithmica 25, 251–278. Beiko, R.G., and Hamilton, N. 2006. Phylogenetic identification of lateral genetic transfer events. BMC Evol. Biol. 6, 15–31. Bordewich, M., and Semple, C. 2005. On the computational complexity of the rooted subtree prune and regraft distance. Ann. Combin. 1–15. Daubin, V., Moran, N.A., and Ochman, H. 2003. Phylogenetics and the cohesion of bacterial genomes. Science 301, 829–832. Degnan, J.H., and Salter, L.A. 2005. Gene tree distributions under the coalescent process. Evolution 59, 24–37. Doolittle, W.F., Boucher, Y., Nesbo, C.L., et al. 2003. How big is the iceberg of which organellar genes in nuclear genomes are but the tip? Phil. Trans. R. Soc. Lond. B. Biol. Sci. 358, 39–57. Ewens, W.J. 1979. Mathematical Population Genetics. Springer-Verlag, Berlin. Hallett, M.T., and Lagergren, J. 2001. Efficient algorithms for lateral gene transfer problems. Proc. RECOMB 2001, 149–156. Hudson, R.R. 1983a. Testing the constant-rate neutral allele model with protein sequence data. Evolution 37, 203–217. Hudson, R.R. 1983b. Properties of the neutral allele model with intergenic recombination. Theor. Popul. Biol. 23, 183–201. Jin, G., Nakhleh, L., and Than, C. 2007. Integrating sequence and topology for efficient and accurate detection of horizontal gene transfer (submitted). Kimura, M. 1969. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61, 893–903. Kingman, J.F.C. 1982. The coalescent. Stochast. Proc. Appl. 13, 235–248. Kunin, V., and Ouzounis, C.A. 2003. The balance of driving forces during genome evolution in prokaryotes. Genome Res. 13, 1589–1594. Kunin, V., Goldovsky, L., Darzentas, N., et al. 2005. The net of life: reconstructing the microbial phylogenetic network. Genome Res. 15, 954–959. Lerat, E., Daubin, V., and Moran, N.A. 2003. From gene trees to organismal phylogeny in prokaryotes: the case of the -proteobacteria. PLoS Biol. 1, 1–9. MacLeod, D., Charlebois, R.L., Doolittle, F., et al. 2005. Deduction of probable events of lateral gene transfer through comparison of phylogenetic trees by recursive consolidation and rearrangement. BMC Evol. Biol. 5, 27–37. Maddison, W.P. 1997. Gene trees in species trees. Syst. Biol. 46, 523–536. Makarenkov, V. 2001. T-REX: reconstructing and visualizing phylogenetic trees and reticulation networks. Bioinformatics 17, 664–668. Moret, B.M.E., Nakhleh, L., Warnow, T., et al. 2004. Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE/ACM Trans. Comput. Biol. Bioinform. 1, 13–23. Nakhleh, L., Warnow, T., and Linder, C.R. 2004. Reconstructing reticulate evolution in species—theory and practice. RECOMB 2004, 337–346. Nakhleh, L., Ruths, D., and Wang, L.S. 2005. RIATA-HGT: a fast and accurate heuristic for reconstrucing horizontal gene transfer. Lect. Notes Comput. Sci. 3595, 84–93. Ochman, H., Lawrence, J.G., and Groisman, E.A. 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299–304. Paulsen, I.T., Banerjei, L., Myers, G.S., et al. 2003. Role of mobile DNA in the evolution of Vancomycin-resistant Enterococcus faecalis. Science 299, 2071–2074. Rambaut, A., and Grassly, N.C. 1997. Seq-gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13, 235–238. CONFOUNDING FACTORS IN HGT DETECTION 535 Rosenberg, N. 2002. The probability of topological concordance of gene trees and species tree. Theoret. Popul. Biol. 61, 225–247. Rosenberg, N.A. 2005. Gene genealogies. In: Fox, C.W., and Wolf, J.B., eds., Evolutionary Genetics: Concepts and Case Studies. Chapter 15. Oxford University Press, New York. Rosenberg, N.A. 2007. A recursion for the number of valid coalescent histories. J. Comput. Biol. (in press). Ruths, D., and Nakhleh, L. 2006. Techniques for assessing phylogenetic branch support: a performance study. Proc. Fourth Asia-Pacific Bioinform. Conf. (APBC 06), 187–196. Saitou, N., and Nei, M. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4, 406–425. Sanderson, M. n.d. r8s software package. Available at: http://loco.ucdavis.edu/r8s/r8s.html. Accessed April 1, 2007. Swofford, D.L. 1996. PAUP : Phylogenetic Analysis Using Parsimony (and Other Methods), Version 4.0. Sinauer Associates, Sunderland, MA. Tajima, F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105, 437–460. Takahata, N. 1989. Gene genealogy in three related populations: consistency probability between gene and population trees. Genetics 122, 957–966. Welch, R.A., Burland, V., Plunkett, G., et al. 2002. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc. Natl. Acad. Sci. USA 99, 17020–17024. Zwickl, D., and Hillis, D. 2002. Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51, 588–598. Address reprint requests to: Dr. Luay Nakhleh Department of Computer Science Rice University 6100 Main St. Houston, TX 77005 E-mail: nakhleh@cs.rice.edu

Log In

Confounding Factors in HGT Detection: Statistical Error, Coalescent Effects, and Multiple Solutions