A Decomposition Theory for Phylogenetic Networks and Incompatible Characters

vikas bansal

A Decomposition Theory for Phylogenetic Networks and Incompatible Characters

Journal of Computational Biology, 2007

A Decomposition Theory for Phylogenetic Networks and Incompatible Characters Dan Gusﬁeld 1,∗ , Vikas Bansal 2 , Vineet Bafna 2 , and Yun S. Song 1,3 1 Department of Computer Science, University of California, Davis 2 Department of Computer Science and Engineering, University of California, San Diego 3 Section of Evolution and Ecology, University of California, Davis E-mail: gusﬁeld@cs.ucdavis.edu (Dan Gusﬁeld), vibansal@cs.ucsd.edu (Vikas Bansal), vbafna@cs.ucsd.edu (Vineet Bafna), and yssong@cs.ucdavis.edu (Yun S. Song) Keywords: Molecular Evolution, Phylogenetic Networks, Perfect Phylogeny, Ancestral Recombi- nation Graph, Recombination, Gene-Conversion, SNP Running Head: A Decomposition Theory for Phylogenetic Networks ∗ Corresponding Author: Dan Gusﬁeld Department of Computer Science University of California, Davis 2063 Kemper Hall Davis, CA 95616 U.S.A. Tel: +1 530 752 7131 Fax: +1 530 752 4767 gusﬁeld@cs.ucdavis.edu 1

Abstract Phylogenetic networks are models of evolution that go beyond trees, incorporating non-tree-like biological events such as recombination (or more generally reticulation), which occur either in a single species (meiotic recombination) or between species (reticulation due to lateral gene transfer and hybrid speciation). The central algorithmic problems are to reconstruct a plausible history of mutations and non-tree-like events, or to determine the minimum number of such events needed to derive a given set of binary sequences, allowing one mutation per site. Meiotic recombination, reticulation and recurrent mutation can cause conﬂict or incompatibility between pairs of sites (or characters) of the input. Previously, we used “conﬂict graphs” and “incompatibility graphs” to compute lower bounds on the minimum number of recombination nodes needed, and to eﬃciently solve constrained cases of the minimization problem. Those results exposed the structural and algorithmic importance of the non-trivial connected components of those two graphs. In this paper, we more fully develop the structural importance of non-trivial connected com- ponents of the incompatibility and conﬂict graphs, proving a general decomposition theorem (ﬁrst presented in Gusﬁeld and Bansal 2005) for phylogenetic networks. The decomposition theorem depends only on the incompatibilities in the input sequences, and hence applies to phylogenetic networks of all types, and to any phenomena that causes pairwise incompatibilities. More generally, the proof of the decomposition theorem exposes a maximal embedded tree structure that exists in the network when the sequences cannot be derived on a perfect phylogenetic tree. This extends the theory of perfect phylogeny in a natural and important way. The proof is constructive and leads to a polynomial-time algorithm to ﬁnd the unique underlying maximal tree structure. We next ex- amine and fully solve the major open question from Gusﬁeld and Bansal (2005): Is it true that for every input there must be a fully decomposed phylogenetic network that minimizes the number of recombination nodes used, over all phylogenetic networks for the input. We previously conjectured that the answer is yes. In this paper we show that the answer in is no, both for the case that only single-crossover recombination is allowed, and also for the case that unbounded multiple-crossover recombination is allowed. The latter case also resolves a conjecture recently stated in Huson and Klopper (2007) in the context of general reticulation networks. Although the conjecture from Gus- ﬁeld and Bansal (2005) is disproved in general, we show that the answer to the conjecture is yes in several natural special cases, and establish necessary combinatorial structure that counterexamples to the conjecture must posses. We also show that counterexamples to the conjecture are rare (for the case of single-crossover recombination) in simulated data. 1 Introduction to Phylogenetic Networks and Problems With the growth of genomic data, much of which does not ﬁt ideal evolutionary-tree models, and the increasing appreciation of the genomic role of such phenomena as recombination, recurrent 2

A Decomposition Theory for Phylogenetic Networks and Incompatible Characters Dan Gusfield1,∗ , Vikas Bansal2 , Vineet Bafna2 , and Yun S. Song1,3 1 Department 2 Department of Computer Science, University of California, Davis of Computer Science and Engineering, University of California, San Diego 3 Section of Evolution and Ecology, University of California, Davis E-mail: gusfield@cs.ucdavis.edu (Dan Gusfield), vibansal@cs.ucsd.edu (Vikas Bansal), vbafna@cs.ucsd.edu (Vineet Bafna), and yssong@cs.ucdavis.edu (Yun S. Song) Keywords: Molecular Evolution, Phylogenetic Networks, Perfect Phylogeny, Ancestral Recombination Graph, Recombination, Gene-Conversion, SNP Running Head: A Decomposition Theory for Phylogenetic Networks ∗ Corresponding Author: Dan Gusfield Department of Computer Science University of California, Davis 2063 Kemper Hall Davis, CA 95616 U.S.A. Tel: +1 530 752 7131 Fax: +1 530 752 4767 gusfield@cs.ucdavis.edu 1 Abstract Phylogenetic networks are models of evolution that go beyond trees, incorporating non-tree-like biological events such as recombination (or more generally reticulation), which occur either in a single species (meiotic recombination) or between species (reticulation due to lateral gene transfer and hybrid speciation). The central algorithmic problems are to reconstruct a plausible history of mutations and non-tree-like events, or to determine the minimum number of such events needed to derive a given set of binary sequences, allowing one mutation per site. Meiotic recombination, reticulation and recurrent mutation can cause conflict or incompatibility between pairs of sites (or characters) of the input. Previously, we used “conflict graphs” and “incompatibility graphs” to compute lower bounds on the minimum number of recombination nodes needed, and to efficiently solve constrained cases of the minimization problem. Those results exposed the structural and algorithmic importance of the non-trivial connected components of those two graphs. In this paper, we more fully develop the structural importance of non-trivial connected components of the incompatibility and conflict graphs, proving a general decomposition theorem (first presented in Gusfield and Bansal 2005) for phylogenetic networks. The decomposition theorem depends only on the incompatibilities in the input sequences, and hence applies to phylogenetic networks of all types, and to any phenomena that causes pairwise incompatibilities. More generally, the proof of the decomposition theorem exposes a maximal embedded tree structure that exists in the network when the sequences cannot be derived on a perfect phylogenetic tree. This extends the theory of perfect phylogeny in a natural and important way. The proof is constructive and leads to a polynomial-time algorithm to find the unique underlying maximal tree structure. We next examine and fully solve the major open question from Gusfield and Bansal (2005): Is it true that for every input there must be a fully decomposed phylogenetic network that minimizes the number of recombination nodes used, over all phylogenetic networks for the input. We previously conjectured that the answer is yes. In this paper we show that the answer in is no, both for the case that only single-crossover recombination is allowed, and also for the case that unbounded multiple-crossover recombination is allowed. The latter case also resolves a conjecture recently stated in Huson and Klopper (2007) in the context of general reticulation networks. Although the conjecture from Gusfield and Bansal (2005) is disproved in general, we show that the answer to the conjecture is yes in several natural special cases, and establish necessary combinatorial structure that counterexamples to the conjecture must posses. We also show that counterexamples to the conjecture are rare (for the case of single-crossover recombination) in simulated data. 1 Introduction to Phylogenetic Networks and Problems With the growth of genomic data, much of which does not fit ideal evolutionary-tree models, and the increasing appreciation of the genomic role of such phenomena as recombination, recurrent 2 and back mutation, horizontal gene transfer, species hybridization, gene conversion, and mobile genetic elements, there is greater need to understand the algorithmics and combinatorics of phylogenetic networks on which extant sequences were derived (Posada and Crandall, 2001; Morrison, 2005). Meiotic recombination between homologous chromosomes is particularly important in deriving chimeric sequences in a population of individuals of the same species, and understanding recombination in populations is a fundamental scientific goal. Moreover, recombination in populations is the key element underlying the logic of “association mapping”, an approach that is widely hoped to be able to efficiently locate genes influencing genetic diseases (Clark, 2003; Zollner and Pritchard, 2005; Minichiello and Durbin, 2006; Wu, 2007). On a much longer time scale, recombination between different species can occur (due to horizontal gene transfer), resulting in a hybrid species. This is called “reticulation” in Huson et al. (2005); Huson and Klopper (2007); Moret et al. (2004). Although meiotic recombination and species reticulation are very different biological events, the evolution of sequences in either biological context can be represented by similar phylogenetic networks and certain mathematical and algorithmic properties of these networks are identical or can be translated from one context to the other. In this paper, we study phylogenetic networks that derive binary sequences. Many of the results apply both to networks representing meiotic recombination and to networks representing species reticulation, while some of the results apply only to the case of meiotic recombination. The assumption that sequences are binary is justified in the case of meiotic recombination by the infinite sites assumption in population genetics (Hein et al., 2005), which is strongly justified by the current importance of SNP (single nucleotide polymorphism) data. In SNP data, each site can generally take on at most two states (alleles) (Chakravarti, 1998; Hinds et al., 2005) in a population. In the context of reticulation at the species level, the assumption that sequences are binary comes from that fact that complex evolutionary characters are usually considered to be binary (being either present or absent) (Felsenstein, 2004). Moreover, as we will detail later, reticulation problems or often defined by a set of splits (or cuts), each of which must be in some tree in the network, and these splits are represented by binary sequences. We begin with a formal definition of a phylogenetic network, primarily in the context of meiotic recombination; later when appropriate, we discuss how specific definitions and results can be applied more broadly to general reticulation. Formal definition of a phylogenetic network: There are four components needed to specify a phylogenetic network that allows recombination (see Figure 1). A phylogenetic network N is built on a directed acyclic graph containing exactly one node (the root) with no incoming edges, a set of internal nodes that have both incoming and outgoing edges, and exactly n nodes (the leaves) with no outgoing edges. Each node other than the root has either one or two incoming edges. A node x with two incoming edges is called a recombination node. 3 00000 M a: 00010 b: 10010 c: 00100 d: 10100 e: 01100 f: 01101 g: 00101 4 00010 3 a: 00010 1 00100 10010 5 2 00101 P 01100 S b:10010 P 3 c: 00100 S 4 01101 g: 00101 10100 d: 10100 f: 01101 e: 01100 Figure 1: A phylogenetic network that derives the set of sequences M . The two recombinations shown are single-crossover recombinations, and the crossover point is written above the recombination node. In general the recombinant sequence exiting a recombination node may be on a path that reaches another recombination node, rather than going directly to a leaf. Also, in general, not every sequence labeling a node also labels a leaf. Each integer (site) from 1 to m is assigned to exactly one edge in N , but for simplicity of exposition, none are assigned to any edge entering a recombination node. There may be additional edges that are assigned no integers. We use the terms “column” and “site” interchangeably. Each node in N is labeled by an m-length binary sequence, starting with the root node which is labeled with some sequence R, called the “root” or the “ancestral” sequence. Since N is acyclic, the nodes in N can be topologically sorted into a list, where every node occurs in the list only after its parent(s). Using that list, we can constructively label the non-root nodes with well-defined sequences in order of their appearance in the list, as follows: a) For a non-recombination node v, let e be the single edge coming into v. The sequence labeling v is obtained from the sequence labeling v’s parent by changing the state (from 0 to 1, or from 1 to 0) of site i, for every integer i assigned to edge e. This corresponds to a mutation at site i occurring on edge e (i.e., during the interval of time represented by edge e). b) For a recombination node x, let Z and Z ′ denote the two m-length sequences labeling the two parent nodes of x. Then the “recombinant sequence” X labeling node x can be any m-length sequence provided that at every site i in X, the state (0 or 1) is equal to the state at site i in (at least) one of the sequences Z or Z ′ . The creation of sequence X from Z and Z ′ at a recombination node is called a “recombination event”. To fully specify the recombination event, we must specify for every site i in X whether the binary state in X “comes from” Z or Z ′ . This specification is forced when the states in Z and Z ′ at site i are different. When they are the same, a choice must be specified. For a given recombination 4 event, we say that a crossover or breakpoint occurs at site i if the states in X at sites i − 1 and i come from different parents. It is easy to determine the minimum number of crossovers needed to create X by a recombination of specific sequences Z and Z ′ . The sequences labeling the leaves of N are the extant sequences, i.e., the sequences that can be observed. We say that an phylogenetic network N derives (or explains) a set of n sequences M (each of length m) if and only if each sequence in M labels one of the leaves of N . Without loss of generality, we assume that there is no site i where all the sequences have the same state at site i. We also assume throughout that M does not contain any duplicate rows. With these definitions, the classic “perfect phylogeny” tree (Gusfield, 1991) is a phylogenetic network with no recombination nodes. That is, each site mutates exactly once in the evolutionary history, and there is no recombination between sequences. Note that in the definition above, there is no bound on the number of crossovers that are allowed at a recombination event (other than the number of sites minus one). Thus, this form of recombination is called “unbounded multiple-crossover” recombination. The main decomposition theorem in this paper applies even when unbounded multiple-crossovers are allowed, and as we will show, multiple crossovers allows us to model a wide variety of biological phenomena. However, in meiotic recombination the number of crossovers is typically small, and the algorithmic/mathematical literature motivated by meiotic recombination has mostly assumed that only one crossover is allowed in a recombination event. This is called “single-crossover recombination”, and when it occurs the recombinant sequence X is formed from a prefix of one of its parent sequences (Z or Z ′ ) followed by a suffix of the other parent sequence. That is the definition of recombination used in our early publications on phylogenetic networks, for example Gusfield et al. (2004a,b). A few papers have studied two-crossover recombination events motivated by “homologous gene-conversion” in Song et al. (2006). For brevity, we will use the phrase “multiple-crossover recombination” to mean “unbounded multiple-crossover recombination”. Some mathematical and algorithmic results depend on whether each recombination event is restricted to a single-crossover, or whether multiple-crossovers are allowed, and we will be very careful about that distinction. What we have defined here as a phylogenetic network with single-crossover recombination is the digraph part of the stochastic process called an “ancestral recombination graph (ARG)” in the population genetics literature (Griffiths and Marjoram, 1996). (See also Norborg and Tavare 2002 or Hein et al. 2005 for an introduction to ARGs). Rooted and Root-Unknown problems: Problems of reconstructing phylogenetic networks, given an input set of binary sequences M , can be addressed either in the rooted case, or the rootunknown case. In the rooted phylogenetic network problem, a required root or ancestral sequence R for the network is specified in advance. In the root-unknown phylogenetic network problem, no ancestral sequence is specified in advance, and the algorithm must select an ancestral sequence at 5 the root. The algorithmic problem of reconstructing a history of recombination events (with mutations), or determining the minimum number of recombination nodes needed in a phylogenetic network (for both rooted and unrooted problems), has been studied in a number of papers (Gusfield et al., 2007; Hein, 1990, 1993; Song and Hein, 2003, 2004; Wang et al., 2001; Myers and Griffiths, 2003; Hudson and Kaplan, 1985; Kececioglu and Gusfield, 1998; Gusfield, 2005a; Gusfield et al., 2004b,a; Nakhleh et al., 2003, 2004; Moret et al., 2004; Bafna and Bansal, 2004, 2006a; Song et al., 2005; Lyngso et al., 2005; Song et al., 2006; Myers, 2003). 2 A Fundamental Decomposition Theory In this section we define and state the main result of this paper, the decomposition theorem. It will be proved in the next section. We believe the decomposition theorem and insights obtained from its proof are fundamental and extend the theory of perfect phylogeny from trees to general phylogenetic networks. We now begin the needed definitions and facts that lead to the statement of the main result. 2.1 Recombination Cycles and Blobs In a phylogenetic network N , let w be a node that has two paths out of it that meet at a recombination node x. Those two paths together define a “recombination cycle” Q. Node w is called the “coalescent node” of Q, and x is the recombination node of Q. In Figure 1, the nodes labeled 00000 and 00100 are coalescent nodes of two different recombination cycles. A recombination cycle that is node-disjoint from any other recombination cycle has been defined as a “gall” (Gusfield et al., 2004b; Gusfield, 2005a; Song, 2006b). If a recombination cycle is edgedisjoint from any other recombination cycle in the network, then the network can be modified so that the cycle is node-disjoint from any other cycle. The modification requires adding one new edge and one new node, with the same sequence labeling both ends of the new edge. Repeating this as needed, we can assume that if a recombination cycle does not share an edge with any other recombination cycle, then it also does not share a node with any other recombination cycle. In contrast, consider a recombination cycle that shares at least one edge with some other recombination cycle. We can add another cycle to those two if the new cycle shares an edge with one of the two cycles. Continuing in this way, adding more cycles, we ultimately get a well-defined maximal set of recombination cycles in N that form a single connected subgraph of N , and each cycle shares at least one edge with some other cycle in the set. We call such a maximal set of recombination cycles a “blob”. Clearly, because of maximality, the blobs in a phylogenetic network N are well-defined, i.e., each blob can be found as above, starting from any recombination cycle in the blob. Moreover, as 6 above, we assume that no blob shares a node with any other blob. Therefore, if we contract each blob in a network N to a single point, the resulting network is a directed tree T ′ . This follows because if the resulting graph had a cycle (in the underlying undirected graph) that cycle would correspond to a recombination cycle in N which should have been contracted. We call T ′ a “tree of blobs” or a “blobbed tree”. So every phylogenetic network N can be viewed as a blobbed tree. The edges in T ′ are called “tree edges” of N . 2.2 Incompatibility and Perfect Phylogeny The main tools that we used in Gusfield et al. (2004b, 2007); Gusfield (2005a); Bafna and Bansal (2004); Song (2006b) and other papers were two graphs representing “incompatibilities” and “conflicts” between sites. We introduce these graphs here. Given a set of binary sequences M , two columns i and j in M are said to be incompatible if and only if there are four rows in M where columns i and j contain all four of the ordered pairs 0,1; 1,0; 1,1; and 0,0. For example, in Figure 1 columns 1 and 3 of M are incompatible because of rows a, b, c, d. The test for the existence of all four pairs is called the “four-gamete test” in the population genetics literature. A site that is not involved in any incompatibility is called a “compatible site”. Given a sequence S, two columns i and j in M are said to conflict (relative to S) if and only if columns i and j contain all three of the above four pairs that differ from the i, j pair in S. It is easy to see that two columns conflict relative to S if and only if they are incompatible in M + S, the set resulting from adding sequence S to M . The classic Perfect Phylogeny Theorem (in the terminology of this paper) is: There is a rooted phylogenetic network N without recombination cycles that derives a set of binary sequences M , if and only if there is no incompatible pair of columns in M . Network N is a directed tree T , with the root labeled by a sequence that need not be in M . Moreover, the undirected tree created by ignoring the directions of the edges of T is invariant over all perfect phylogenies for M . Similarly, there is a unique phylogenetic network that derives M , with ancestral sequence S and without recombination cycles (and hence is a directed tree), if and only if there is no pair of columns that conflict relative to S. For one exposition of these classic results, see Gusfield (1997). For different expositions, see Theorem 3.1.4 in Semple and Steel (2003) or the Pairwise Compatibility Theorem in Felsenstein (2004). Incompatibility and Conflict Graphs: We define the “incompatibility graph” G(M ) for M as a graph containing one node for each column (site) in M , and an edge connecting two nodes i and j if and only if columns i and j are incompatible. Similarly, given a sequence S, we define the “conflict graph” GS (M ) for M (relative to S) as a graph containing one node for each column in M , and an edge connecting two nodes i and j if and only if columns i and j conflict relative to S. 7 00000 4 00010 3 a: 00010 1 00100 10010 P b:10010 M a: 00010 b: 10010 c: 00100 d: 10100 Incompatibility Graph e: 01100 for M. f: 01101 g: 00101 1 2 3 4 5 S 00100 c: 00100 3 5 2 10100 00101 01100 P d: 10100 e: 01100 S 4 g:00101 01101 f: 01101 Figure 2: The incompatibility graph G(M ) for the sequences M from Figure 1, and a fullydecomposed phylogenetic network that derives M . This network is also a galled-tree for M . It is easy to see that GS (M ) = G(M ) if S ∈ M , and so GS (M ) = G(M + S). Figure 2 shows the incompatibility graph G(M ) for M from Figure 1. A “connected component” (or “component” for short), C, of a graph is a maximal subgraph such that for any pair of nodes in C there is at least one path between those nodes in the subgraph. A “trivial” component has only one node, and no edges. The incompatibility graph in Figure 2 has two components. Previously (Gusfield et al., 2004b; Gusfield, 2005a; Gusfield et al., 2007, 2004a; Bafna and Bansal, 2004), the non-trivial connected components of the conflict and incompatibility graphs were shown to be very informative, used both to derive efficient algorithms and to expose combinatorial structure in phylogenetic networks. The structural importance of the non-trivial connected components is further developed in the main result, presented next. 2.3 The Decomposition Theorem Theorem 1. Let G(M ) be the incompatibility graph for M . Then, there is a phylogenetic network N that derives M where every blob in N contains all and only the sites of a single non-trivial connected component of G(M ), and every compatible site is on a tree edge of N . The result holds no matter what constraints, if any, are placed on the number of crossovers allowed at a recombination node. Stated another way, for any input M , there is a blobbed-tree that derives M , where the blobs are in one-one correspondence with the non-trivial connected components of G(M ), and if bC is the blob corresponding to component C, then bC contains all and only the sites in C. We call a network “fully-decomposed” if it has the structure specified in Theorem 1. Figure 2 shows a fully-decomposed network for the sequences M from Figure 1. Theorem 1 is an extension of the stronger theorem proved in Gusfield et al. (2004b) about 8 galled-trees. In the case of galled-trees, every reduced galled-tree for M must be fully-decomposed. A galled-tree is “reduced” if every recombination cycle contains some incompatible sites. When there is a galled-tree for M , there is a reduced galled-tree for M , and it can be found in polynomial time (Gusfield et al., 2004b; Gusfield, 2005a). This strong decomposition property of galled-trees was one of the main motivations for investigating decomposition in general phylogenetic networks and the general role of connected components of G(M ) in decomposition. It is easy to prove (Gusfield et al., 2004b) a statement converse to Theorem 1: In any phylogenetic network N that derives M , all sites from the same non-trivial connected component of G(M ) must appear on the same blob in N , and this does not depend on the number of crossovers used at recombination node. That is, it is not possible to split up the sites from a connected component of G(M ) into two or more blobs in N . Therefore the network that is guaranteed by the Decomposition Theorem is as highly decomposed into distinct blobs as possible, justifying the term “fully-decomposed”. There is an analogous theorem to Theorem 1 in the case that the ancestral sequence S is known in advance. In that case, there is a phylogenetic network N that derives M , with ancestral sequence S, where the blobs in N are in one-one correspondence with the non-trivial connected components of GS (M ), and any non-conflicting site is on a tree edge of N ; again, no further decomposition is possible. This follows from Theorem 1, and the fact that two sites conflict relative to S if and only if they are incompatible in M + S. We will prove the Decomposition Theorem in the next section and show that the tree parts of any fully-decomposed phylogenetic network are invariant. 3 Proof of The Decomposition Theorem In this section we prove Theorem 1. 3.1 The structure of M Let C and C ′ be two distinct connected components in the incompatibility graph G(M ), and assume first that C is non-trivial. C ′ might be a trivial connected component, i.e., consist of only a single node. For any pair of sites i ∈ C, i′ ∈ C ′ let (X, X) and (Y, Y ) be the respective bipartitions (of the rows of M ), associated with sites i and i′ . The two bipartitions cannot be identical, for otherwise sites i and i′ would have exactly the same incompatibilities and so be in the same connected component. Each of the four subsets X, X, Y, Y is called a “class” of the bipartition it is part of. Sites i and i′ are not incompatible, so one class of the i bipartition must strictly contain one class of the i′ bipartition, and the other class of the i′ bipartition must strictly contain the other class of the i bipartition. Without loss of generality, suppose X ⊃ Y and Y ⊃ X. We say that X is the “dominant” class of i, and X is the “dominated” class, with respect to the pair i, i′ . Similarly, Y is 9 the dominant class of i′ , and Y is the dominated class, with respect to the pair i, i′ . For example, in Figure 2, C = {1, 3, 4} and C ′ = {2, 5}, and with respect to the pair 1, 2, the set {a, c, e, f, g} is the dominant class, and {b, d} is the dominated class. Now consider the case that sites i and i′ are the sites of two trivial components C and C ′ . If i and i′ are not identical, then one class of i, say X, must strictly contain a class, say Y , of i′ , and so X and Y are again the well-defined dominant classes of i and i′ , with respect to i, i′ . If sites i and i′ are identical, and say X = Y , then we arbitrarily choose the pair X, Y or X, Y as the two dominant classes of i and i′ respectively. Lemma 1. Let i, i′ , C, C ′ , X, and Y be as above. Let j ′ be any site in C ′ , and let (Z, Z) be the bipartition associated with j ′ . Then, the dominant class of i with respect to the pair i, j ′ is the dominant class of i with respect to the pair i, i′ . Proof. The Lemma is vacuously true if C ′ is a trivial connected component, so assume C ′ is nontrivial. We need to show that either X ⊃ Z or X ⊃ Z. Consider a site k′ ∈ C ′ that is incompatible with i′ . Such a site k′ must exist since C ′ is connected. Let (W, W ) be the bipartition defined by site k′ . If X is not dominant with respect to i, k′ , then X is dominant with respect to i, k′ , and so either X ⊃ W or X ⊃ W . Suppose that X ⊃ W , so W ⊃ X. But then W ⊃ Y since X ⊃ Y , and so Y ∩ W = ∅, and i and k′ can’t be incompatible, which is a contradiction. Similarly, if X ⊃ W , then W ⊃ X, so W ⊃ Y , and W ∩ Y = ∅, a contradiction. So the dominant class, X, with respect to i, i′ is the dominant class with respect to i, k′ , where k′ is any site that is incompatible with i′ . The Lemma now follows by transitivity, because C ′ is a connected component, so from i′ it is possible to reach any j ′ ∈ C ′ by a series of incompatibility relations. Lemma 1 establishes that for any i ∈ C, one class of i is dominant with respect to all sites in C ′ , and symmetrically, for any i′ ∈ C ′ one class of i′ is dominant with respect to all sites in C. So, with respect to the (C, C ′ ) pair of connected components, each site in C ∪ C ′ has a well-defined dominant class, and a well-defined dominated class. Now return focus to the sequences in M and the sites in C and C ′ . For a site i ∈ C, the bipartition (X, X) is encoded with 0’s and 1’s, where all the rows in X have one character at site i and all the rows in X have the other character at site i. So, with respect to the (C, C ′ ) pair of connected components, and a specific set of sequences M , each site in C has a well-defined dominant character (either 0 or 1). For example, in Figure 3, the dominant character is 0 in all sites except 3, where the dominant character is 1. Let D[C, C ′ ] be the union of the rows in the dominated classes of C, with respect to (C, C ′ ). Similarly, let D[C ′ , C] be union of the rows in the dominated classes of C ′ , with respect to (C, C ′ ). For example, D[C, C ′ ] is {a, b, d} and D[C ′ , C] is {e, f, g} in Figure 3. Let M (C) and M (C ′ ) be the sequences in M , restricted to the sites in C and C ′ respectively. Then Lemma 1 implies 10 1 3 4 2 5 a 0 0 1 0 0 b 1 0 1 0 0 d 1 1 0 0 0 c 0 1 0 0 0 e 0 1 0 1 0 f 0 1 0 1 1 g 0 1 0 0 1 Figure 3: The sites in the two connected components from Figure 2. We denote the component with sites {1, 3, 4} as C, and the component with sites {2, 5} as C ′ . The dominant sequence for C is 010, and the dominant sequence for C ′ is 00. The rows in D[C, C ′ ] are {a, b, d}, and the rows in D[C ′ , C] are {e, f, g}. The rows and columns have been permuted from their natural order to collect together the sites in the two components, and the rows in D[C, C ′ ] and D[C ′ , C]. Note that row c is in neither D[C, C ′ ] nor D[C ′ , C], since row c has the dominant sequence in both its C and C ′ sides. Theorem 2. Every row in D[C, C ′ ] has the same sequence in M (C ′ ). In particular, in each row of D[C, C ′ ], every site i′ ∈ C ′ has the dominant character with respect to (C, C ′ ). Similarly, every row in D[C ′ , C] has the same sequence in M (C). In particular, in each row of D[C ′ , C], every site i ∈ C has the dominant character with respect to (C, C ′ ). Given Theorem 2, we can define the dominant sequence in M (C) with respect to (C, C ′ ) as the sequence in M (C) where each site has the dominant character with respect to (C, C ′ ). Similarly, we can define the dominant sequence in M (C ′ ) with respect to (C, C ′ ). Corollary 1. Let C and C ′ be two connected components of G(M ). There is no row in M which (w.r.t. (C, C ′ )) contains both a non-dominant sequence in M (C) and a non-dominant sequence in M (C ′ ). Figure 3 illustrates Lemma 1, Theorem 2 and Corollary 1. Note that a row can have both the dominant sequence in M (C) and the dominant sequence in M (C ′ ). Row c in Figure 3 is an example of this. Corollary 1 is a more refined version, and with a simpler proof, of a result (Theorem 3) first proved in Bafna and Bansal (2004). We develop here an observation that will be needed in Section 5.2. Consider two sites i ∈ C and i′ ∈ C ′ , where C and C ′ are two distinct non-trivial connected components of G(M ). By assumption, every column contains both a 0 and a 1. Also, any two columns in different non-trivial components must not be identical (or exact compliments of each other) or else they would both be 11 incompatible with the same sites in C and C ′ , and hence would be in the same component. So, there must be at least three distinct pairs of binary characters in columns i, i′ . But, since i and i′ are not incompatible, there cannot be four distinct pairs binary characters in those columns, so there must be exactly three distinct pairs of binary characters in columns i, i′ . Hence, if we add new sequences to M , creating the set M , and i and i′ are not incompatible in M , then any pair of binary characters in M in columns i, i′ , must already have been a pair of binary characters in columns i, i′ in M . Therefore, with respect to (C, C ′ ), the dominant characters in C and C ′ created from M are the same as the dominant characters of C and C ′ created from M . Extending this observation to all sites in C and C ′ we have: Lemma 2. Suppose C and C ′ are non-trivial components of G(M ). If we add new sequences to M which do not create any incompatibilities between sites in different components of G(M ), then with respect to (C, C ′ ), the dominant sequences of C and C ′ remain unchanged. 3.2 The super-characters of M and the new matrix B Lemma 1, Theorem 2 and Corollary 1 establish a structure that exists in M , imposed by the partition of the columns of M by the connected components of G(M ). We begin now to exploit that structure to prove the Decomposition Theorem. We define a new set of binary sequences B created from M and G(M ), and represent the set B as a matrix, as follows. Let C be a connected component of G(M ) and let M (C) be the sequences in M restricted to the sites in C. We call each distinct sequence in M (C) a super-character of M (associated with C). For every C, we create one column, s in B for each super-character S of M , where S ∈ M (C). We say that s originates from S and from C. Each such column in B encodes a bipartition of the rows of M where one side of the bipartition contains all the sequences in M that contain subsequence S in M (C), and the other side of the bipartition contains the remaining sequences. More specifically, and without loss of generality, in the new column we assign value 1 to each sequence in M which contains subsequence S in M (C), and assign value 0 to each sequence that does not. The new column defines a binary character derived from M and G(M ). Note that if C is a trivial connected-component, so it only contains one site, then B will have two columns derived from that one site, but those columns define the same bipartition. That will cause no problems, and one column can be removed for simplicity. As an example, Figure 4 shows the matrix B that is derived from M and G(M ) from Figure 2. We will use the characters of B to build a tree in order to prove Theorem 1. Lemma 3. No pair of characters of B are incompatible. Proof. Let p and q be distinct characters in B. If p and q originate from the same connected component C in G(M ), then by construction, no row can have a 1 in both columns p and q and therefore, characters p and q are not incompatible. 12 1 2 3 4 5 6 7 8 a 1 0 0 0 1 0 0 0 b 0 1 0 0 1 0 0 0 c 0 0 1 0 1 0 0 0 d 0 0 0 1 1 0 0 0 e 0 0 1 0 0 1 0 0 f 0 0 1 0 0 0 1 0 g 0 0 1 0 0 0 0 1 Figure 4: Matrix B derived from M and G(M ) from Figure 2. The super-characters of M associated with C are 001, 101, 010, 110, and the super-characters of M associated with C ′ are 00, 10, 11, 01. The columns (characters) of B are ordered to correspond to those ordered lists of super-characters of M . Now suppose p and q originate from two different connected components C and C ′ in G(M ). If p and q both originate from the non-dominant sequences (w.r.t. (C, C ′ )) of C and C ′ , then Corollary 1 guarantees that there is no row with 1, 1 in columns p and q, and so p and q cannot be incompatible. Symmetrically, if p and q both originate from dominant sequences (w.r.t. (C, C ′ )) of C and C ′ , then there is no row with 0, 0 in columns p and q. If p originates from the dominant sequence (w.r.t. C, C ′ ) of C and q originates from a non-dominant sequence (w.r.t. C, C ′ ) of C ′ , then there can be no 0, 1 in columns p and q. The remaining case is symmetric. Applying the Perfect Phylogeny Theorem, Lemma 3 establishes that there is a perfect phylogeny T where each character of B labels one edge in T , and each edge is labeled by one or more characters of B. Since each character of B originates from a super-character of M , it will sometimes also be useful to think of edge labels as being super-characters of M . 3.3 Inflating T The next step is to inflate nodes of T to blobs in order to create a fully-decomposed phylogenetic network N for M , proving Theorem 1. The removal of any edge e in T creates two connected subtrees, and we define a “split of edge e” as the bipartition of the leaves resulting from the removal of edge e from T . From the facts that all columns in B are distinct, and that every edge in T is labeled, it follows that all the splits defined by the edges in T are distinct. If e is labeled by character c of B, we define the “1-side” of e as the subtree of T − e that contains the leaves for rows in B that have value 1 for character c. The other side is called the “0-side” of the split. When no root sequence has been specified in advance, the root could be on either side of the split. 13 Lemma 4. Let C be an arbitrary component of G(M ). In T , there is a node vC such that all the edges labeled by characters of B that originate from connected component C, are incident with vC . That is, these edges form a star around a single central node vC . Further, vC is on the 0-side of each split defined by an edge labeled by a character of B that originates from C. Note however, that Lemma 4 does not assert that vC is only incident with edges labeled by characters that originate from C. Proof. First, the Lemma is trivially true if C is a trivial component since only one character originates from C and it labels only one edge. For any non-trivial connected component C, the submatrix M (C) contains at least four distinct sequences since there are must be at least one (incompatible) pair of sites in C where all four binary pairs appear in M (C). Therefore, there are at least four characters in B that originate from C. Let B(C) denote the columns of B restricted to the characters that originate from C. Consider a non-trivial connected component C and any three of characters of B that originate from C, and let e1 , e2 , e3 be the three edges in T labeled with those characters. Note that every row in B has value 1 in exactly one column of B(C), so every leaf of T is on the 1-side of exactly one edge labeled by a character from C. Hence, no leaf in T can be on the 1-side of two of the edges e1 , e2 , e3 . Now, consider the undirected tree created by ignoring the directions of the edges in T , but we will still refer to this undirected tree as T . If e1 and e2 are incident with each other, sharing a node v, then there must be another edge incident with node v, and hence there must be a leaf lv that is reachable from v without going through e1 or e2 . If this were not true, then e1 and e2 would define the same splits in T , which is not possible. If e1 and e2 are not incident with each other, then there is a unique shortest path P from an endpoint of e1 to an endpoint of e2 . Clearly, path P does not contain edge e1 or e2 . There must be a node v on P and a leaf lv that is reachable from v via a path that does not go through e1 or e2 . If this were not true, then again there would be two adjacent edges that define the same splits in T . Now we claim that node lv must be on the 0-side of both e1 and e2 . We have already established that it cannot be on the 1-side of both. However, suppose without loss of generality, that lv is on the 1-side of e1 and the 0-side of e2 . Then consider the endpoint u of e2 that is on the 1-side of e2 , and consider a leaf lu that is reachable from u without going through e2 . Leaf lu would be on the 1-side of both e1 and e2 , which is not possible. Hence the 1-sides of both e1 and e2 point “away” from each other. It also follows that path P cannot go through edge e3 . If it did, then some leaf on the 1-side of e3 would also be on the 1-side of e1 or e2 . So edges e1 and e2 are either incident with each other, or there is an edge e which is incident with e1 on path P , where e is not labeled by a character of B that originates from C. We will show that such an edge e cannot exist. Every internal edge in T is labeled by some character of 14 B, so suppose e exists and is labeled by a character that originates from a connected component C ′ . Let v be the common endpoint of e1 and e. As above, there must be a leaf lv that is reachable from v without going through either edge e or e1 , for otherwise e and e1 define the same split, which is not possible. Recall that each character of B and each split in T that originates from C or C ′ , corresponds to a sequence (super-character) in M (C) or M (C ′ ), and with respect to the pair (C, C ′ ), there is a dominant sequence S in M (C) and a dominant sequence S ′ in M (C ′ ). Let e(S) be the edge in T labeled by the character that originates from S, and let e(S ′ ) be the edge in T labeled by the character that originates from S ′ . Now e1 is either e(S) or not, and e is either e(S ′ ) or not, so we have four cases to consider. Case 1: Suppose e1 is e(S) and e is e(S ′ ). We know that l(v) is on the 0-side of e1 , so it must be on the 1-side of e by Corollary 1. But then, all leaves on the 0-side of e will be on the 0-side of both e and e1 , which contradicts Corollary 1. So e cannot exist in this case. The three other cases are similar and omitted, and the result is that e cannot exist and hence e1 and e2 are incident with each other. Since e1 and e2 were arbitrary edges labeled by characters that originated from C, every pair of edges labeled by characters that originate from C must be incident with each other. But in a tree, that is only possible if all those edges share exactly one endpoint, and so form a star around a single center. That endpoint is the claimed node vC . We also established that if there are two distinct edges labeled with characters that originate from C, then the 1-sides of these edges point away from each other. This holds for any pair of edges labeled with characters that originate from C, so vC is on the 0-side of every such edge. 3.4 Completion of the proof of the Decomposition Theorem To finish the proof of the Decomposition Theorem, we first ignore the direction of the edges of T and ignore which node is its root. Instead, we arbitrarily select a node vr of T to be the root and direct all the edges in T away from that chosen root. Let S be the label written at node vr (recall that a perfect phylogeny is a phylogenetic network and so each node is labeled). The label S will define the ancestral sequence for the phylogenetic network we will construct. Next, we need to inflate each node vC in T that is the central node of the star associated with the characters obtained from a non-trivial component C of M (G). Note that vr might also be a central-star node. We can identify the central-star node vC by the fact that for the non-trivial connected component C, all of the edges labeled by the characters that originate from C are incident with vC . Each such edge may also be labeled with characters that originate from another connected component or with a compatible character. Now, each central-star node vC , other than the root node, has exactly one edge directed into it; the character on it that originates from C must be the super-character S(C), defined as sequence S restricted to the sites in C. We call S(C) the “ancestral sequence” of vC . Similarly, if the root node is associated with the non-trivial component C, then S(C) is the 15 ancestral sequence of vr = vC . Now, any super-character in M that is associated with C can be derived from S(C) using at most one mutation per site, if an unlimited number of recombination nodes are allowed. This is true even if only single-crossovers are allowed, or if multiple-crossovers are permitted1 . So, each central-star node v can be inflated into a blob bv containing one node labeled by each super-character in M (C), and other nodes if needed. Then for each super-character associated with C, we connect the node in bv labeled with the character c in B which originates from that super-character, to the edge incident with vC that is labeled by character c. After inflating each central-star node in T , the end result is a phylogenetic network N where each blob contains all and only the sites from one connected component of G(M ). Every compatible site labels a tree edge of N . The full ancestral sequence for N is specified by the ancestral sequences defined above, since each site in M is in the ancestral sequence for exactly one central-star node. This completes the proof of Theorem 1. Note that the existence and the topology of T depends only on the partition of the nodes of G(M ) into connected components, and hence does not depend on the biological causes of the incompatibilities in M . In particular, it does not depend on whether or not multiple-crossovers are allowed at recombination nodes. However, the networks inside each blob of N do depend on which biological events (such as single versus multiple-crossover recombination, or recurrent mutation, etc.) occur there. 3.5 Component-wise optimal decomposition We can now state a fact that follows easily from Theorem 1 and will be needed in Section 5.2. Recall that S(C) is the sequence S restricted to the sites in component C. Let RS(C) (M (C)) be the minimum number of recombination nodes (events) needed to generate the sequences M (C) in a phylogenetic network with ancestral sequence S(C), when multiple-crossover recombination is 1 allowed. Similarly, let RS(C) (M (C)) be the minimum number of recombination nodes needed when only single-crossover recombination is allowed. The proof of Theorem 1, when applied to the set of sequences M + S, where S is a required ancestral sequence (which might not be in M ), and the root of T is selected to be the node labeled S, establishes the following: Theorem 3. For any sequence S, there is a fully-decomposed phylogenetic network for M with anP cestral sequence S, containing exactly C∈GS (M ) RS(C) (M (C)) recombination nodes when multipleP 1 crossover recombination is allowed, and containing exactly C∈GS (M ) RS(C) (M (C)) recombination nodes when only single-crossover recombination is allowed. 1 It is also true that the sequences can be derived from S(C) without recombination if an unlimited number of recurrent or back mutation events are allowed. Recurrent-mutation occurs when the state of a site mutates from its ancestral state more than once in an evolutionary history. Back-mutation occurs when the state mutates from the derived state back to the ancestral state. 16 3.6 Programs The proof of the existence of T can be converted into an efficient, constructive method2 for finding T from any input M . The program galledtree.pl, available at wwwcsif.cs.ucdavis.edu/~gusfield/ takes in a set of sequences M and tries to build a galled-tree for M using single-crossover recombination. If it succeeds, then it has produced a complete phylogenetic network for M where each blob is a single cycle, and the cycles are node disjoint. Hence, the program produces a fully-decomposed phylogenetic network for M . If the program determines that there is no galled-tree for M , then it outputs the tree T for M . The running time for the program is O(nm2 + m3 ), but the time used to build T is just O(nm2 ). 3.7 Uniqueness of T In a network N , we say that a node v in blob b is an external node if there is an edge from v to some node outside of b. We say a phylogenetic network N is efficient if N does not contain two external nodes with the same node labels, and for each external node v (in a blob b), there is exactly one edge from v to a node off of b. Clearly, if N is not efficient, it can be easily modified to become efficient without increasing the number of recombinations used. Also, if N is fully-decomposed, then the efficient network derived from it is also fully-decomposed. So for any M , there is a efficient fully-decomposed phylogenetic network for M . Theorem 4. If N is any efficient fully-decomposed network for M , and T ′ is created by contracting each blob of N to a single node, then after the directed edges in T ′ are made undirected, the resulting tree is necessarily the tree T defined in the proof of Theorem 1. Proof. Consider a blob bC in N associated with the component C in G(M ); let v be an external node in bC incident with the edge e = (v, v ′ ), directed from v to a node v ′ off of bC . In the subnetwork reached from v ′ , all of the sites in C have the same state they have at v ′ . This is because no site in C mutates outside of bC , and (by induction on the maximum path length from from v ′ to a recombination node x) at any recombination node x in the subtree, the state of any site in C is the same at x as it is in both of the parent sequences of x. Now every super-character in M (C) is a subsequence of some sequence of M , and the super-characters include all the distinct sequences in M (C), therefore (when restricted to the sites in C) each external node on bC is labeled with a distinct super-character derived from C, and each super-character derived from C labels exactly one external node of bC . Consider again node v, labeled with super-character Sv (C) from M (C), and consider the edge e = (v, v ′ ). If we remove edge e from N , two disconnected subnetworks 2 It may seem that T can be obtained by simply building a perfect phylogeny T using one site from each connected component of G(M ). However, this is wrong, because the edge structure of T may be very different from that of T . For example, in the tree T created from sites 1 and 2 in Figure 1, the two edges labeled with those sites are adjacent, while they are not adjacent in T . 17 are created. In one subnetwork, every leaf label contains Sv (C), and in the other subnetwork, no leaf label contains Sv (C). The leaf labels are the sequences of M , and so the removal of e from N creates a bi-partition of the sequences in M . Clearly then, edge e defines exactly the same bi-partition of the sequences in M that is defined by the split K in T created by removing from T the edge labeled by the character that derives from super-character Sv (C). Now edge e is also contained in tree T ′ , and its removal from T ′ creates the same bi-partition that it does in N , and so the removal of e from T ′ creates the split K. Therefore, the splits in T ′ are exactly the same as the splits in T . By Theorem 3.1.4 (the Splits Equivalence Theorem) in Semple and Steel (2003), the splits of an undirected tree uniquely define the tree, and so the undirected trees, T ′ and T are identical. In other words, T is the invariant underlying structure of any efficient fully-decomposed phylogenetic network for M , and this is true regardless of the biological causes of the incompatibilities in M . We will need this fact in Section 5.1. 3.8 Alternate Proofs of Theorem 1 Theorem 1 was first stated and proved in Gusfield and Bansal (2005). It has been pointed out (Steel, 2005) that Theorem 1 can also be proven by using Buneman graphs (Semple and Steel, 2003), and the details of this approach have been verified (Wu, 2005). However, the proof here is more direct and establishes a polynomial time algorithm to construct T . In contrast, it takes exponential time in worst case to build a Buneman graph from M , and so that is not an efficient constructive approach to building T from M . Another direct proof of Theorem 1 that is shorter than the one presented here, but does not establish or emphasize the role of super-characters, appears in a 2006 preprint by Bafna and Bansal (2006b). Subsequent to the development of Theorem 1, a related decomposition theorem was developed (Huson et al., 2005) where the input to the problem is not a set of sequences, but a set of trees that must be subtrees in a constructed phylogenetic network. 4 4.1 Broader Applications Alternate causes of incompatibility In the proof of the Decomposition Theorem, there was no mention of recombination until close to the end of the proof, when discussing the inflation of the central-star nodes. Therefore, all the results proven to that point hold for any incompatible characters of M , independent of the biological cause of the incompatibilities. Also, the proof of uniqueness did not depend on recombination. Hence, the existence, structure and uniqueness of T holds for any M and any biological cause of incompatible characters. In this way, we have established that the super-characters of M , 18 defined by the connected components of G(M ), generalize standard evolutionary characters (used in phylogenetic trees), and play a role in the theory of phylogenetic networks that tree characters play in the theory of phylogenetic trees. Moreover, if the biological operations that caused the incompatibilities in M allow any set of sequences M (C) to be derived from an arbitrary ancestral sequence, then the Decomposition Theorem holds in that biological context. Another way to see the generality of the results proven here is to note that multiple crossover recombination can be considered as a mathematical operation on binary sequences rather than a biological event, and can be used to model biological events that don’t explicitly involve recombination. For example, an occurrence of back-mutation or recurrent-mutation of a site i in a sequence S can be modeled as a two-crossover recombination between S and some appropriate sequence, in the intervals i − 1, i and i, i + 1. Modeling back and recurrent mutations in this way explicitly creates recombination cycles and blobs, and shows explicitly how Theorem 1 applies when back-mutation and/or recurrent mutation cause incompatibilities. Generally, when back or recurrent mutation is the cause of incompatibility, we seek an evolutionary tree that derives a given set of sequences using as few back or recurrent mutations as possible. Such a tree is called a “maximum parsimony tree” and it is a solution to the maximum parsimony problem (Felsenstein, 2004; Semple and Steel, 2003). 4.2 What is the “most tree-like” phylogenetic network? When a set of sequences M fails the four-gametes test and hence cannot be generated on a perfect phylogeny, one would still like to derive the sequences on a phylogenetic network that is the “most tree-like”. There is no accepted definition of “treeness”, and under many natural definitions, the problem of finding the most tree-like network would likely be computationally difficult. In this section, we introduce a measure of treeness and relate it to Theorem 1. Recall that it is assumed that in network N no two blobs share a node. We can also assume that N has no node with in and out-degrees that are both one. Then if each blob is contracted to a single node, the number of edges in the resulting directed tree measures the “treeness” of N . In other words, the “treeness” of N is measured by the size of the tree in the underlying tree structure of N . For example, if all the sites in M are in a single blob in N , then N is less tree-like than a network where the sites are distributed between several blobs, connected by several edges in a tree structure. With the above definition of “treeness”, it is clear that a phylogenetic network N is “the most tree-like” if and only if T is the resulting undirected tree, after the blobs of N are contracted, and all the edges are made undirected. This follows from Theorem 1 and the fact (Gusfield et al., 2004b) that all the sites in a single non-trivial connected component of G(M ) must be together in a single blob in any phylogenetic network. 19 This definition of “most tree-like” is somewhat crude because it does not consider any details inside of a blob, but it has the advantage of being easy to compute and allowing a clear identification of the most tree-like networks. Further, it seems reasonable that any other natural definition of “most tree-like” would identify a subset of the networks identified by the definition considered here. 5 The Full-Decomposition Optimality Conjectures Clearly, the task of constructing networks is simplified (both algorithmically and conceptually) if we restrict ourselves to fully-decomposed networks, and we have seen above that there always is a fully-decomposed network for any M . But if our goal is to minimize the number of recombination nodes (events) is it effective to restrict ourselves in this way? The answer is yes when M can be derived on a galled-tree, even if the galled-tree must only use single-crossover recombinations, but any competing phylogenetic network is allowed to use multiple-crossovers Gusfield et al. (2004b); Gusfield (2005a). Now, we address this issue in general. 5.1 Introduction to the conjectures For a set of sequences M , recall that R(M ) is the minimum number of recombination nodes in any phylogenetic network for M when multiple-crossover recombination is allowed; R1 (M ) is the minimum number when only single-crossover recombination is allowed. We similarly define RS (M ) and RS1 (M ) as the minimum number of recombination nodes in any phylogenetic network for M with ancestral sequence S when, respectively, multiple-crossovers are allowed and when only singlecrossovers are allowed. A network that has the minimum possible number of recombination nodes (and conforms to the chosen crossover model and uses the required ancestral sequence, if any) is called “optimal”. We say that a set of binary sequences M are “min decomposable” or “min-1 decomposable” respectively, if there is a fully-decomposed phylogenetic network for M , allowing multiple-crossovers or only allowing single-crossovers respectively, using exactly R(M ) or R1 (M ) recombination nodes respectively. Similarly, a set of binary sequences M are “S-min decomposable” or “S-min-1 decomposable” respectively, if there is a fully-decomposed phylogenetic network for M with ancestral sequence S, allowing multiple-crossovers or only allowing single-crossovers respectively, using exactly RS (M ) or RS1 (M ) recombination nodes respectively. In Gusfield and Bansal (2005) we stated the Full-Decomposition Optimality Conjecture, which we now state more precisely as: Unrooted Full-Decomposition Optimality Conjectures: Every set of binary sequences is min decomposable and min-1 decomposable. 20 11110000 11000000 11100000 00110000 01110000 00001111 00001100 00001110 00000011 00000111 Figure 5: Example showing that R(M ) > P C∈G(M ) R(M (C)). Rooted Full-Decomposition Optimality Conjectures: Every set of binary sequences is S-min decomposable and S-min-1 decomposable for any ancestral sequence S. A related version of the unrooted conjecture was also recently stated (as Conjecture 1) in (Huson and Klopper, 2007). These (four) conjectures say that there is no loss of optimality in restricting attention to fully-decomposed phylogenetic networks. Note that the rooted conjectures imply the original unrooted conjectures, by letting S be the ancestral sequence that appears in the optimal phylogenetic network using R(M ) (respectively R1 (M )) recombination nodes. Hence, we will later restrict attention to the rooted versions of the conjecture when proving positive results, and restrict attention to the unrooted versions of the conjectures when proving negative results. P The rooted conjectures are equivalent to the conjectures that RS (M ) = C∈GS (M ) RS(C) (M (C)) P 1 and RS1 (M ) = C∈GS (M ) RS(C) (M (C)). However the unrooted conjectures are not equivalent P P to the stronger statements that R(M ) = C∈G(M ) R(M (C)) and R1 (M ) = C∈G(M ) R1 (M (C)). The reason is that the optimal solutions for the separate components may choose different ancestral sequences that cannot be combined into a single network. For example, consider the set of sequences M shown in Figure 5. The incompatible pairs are (1, 3), (1, 4), (2, 3), (2, 4), (5, 7), (5, 8), (6, 7), (6, 8), so there are two connected components in G(M ): component C1 containing the first four sites of M , and component C2 containing the last four sites of M . We can establish, using the program beagle (Lyngso et al., 2005), that R1 (M (C1 )) = R1 (M (C2 )) = 1, but R1 (M ) = 3. In Gusfield and Bansal (2005), we also stated that when the recombination events only model recurrent and back mutations (as detailed in Section 4.1), then every M is min decomposable. We will establish a specialization of this fact with a direct proof below; the general statement follows from results on the maximum parsimony problem using median networks, as established by Bandelt 21 et al. (1995) (see also Semple and Steel 2003). Because of this result, when incompatibilities are caused by recurrent and/or back mutation, one can solve the maximum parsimony problem exactly, for each set M (C) separately, and then connect the trees as specified by T . Since the maximum parsimony problem is NP-hard and the only known methods to solve it take exponential time in worst-case, decomposing the problem into several smaller problems may allow larger problems to be solved in practice. In general, whenever M is min decomposable (or S-min, min-1, or S-min-1 decomposable) we can follow a similar approach to finding phylogenetic networks that minimize the number of recombination nodes. This is easiest to explain when M is S-min or S-min-1 decomposable. Then, T is constructed from M + S, S is chosen as the root sequence of T , and for each C ∈ GS (M ), S(C) is the ancestral sequence for the blob associated with component C. Hence we can solve a single (rooted) problem for each set M (C) and then connect the trees as specified by T . More generally, even if M is not S-min or S-min-1 decomposable, this method will find the minimum number of recombination nodes, denoted FS (M ) and FS1 (M ), used in any fully decomposed network for M with ancestral sequence S, respectively allowing multiple-crossover or allowing only single-crossovers. The approach is a bit more involved when no ancestral sequence is known in advance. Suppose we choose an interior node v in T to be the root of T . That selection determines the ancestral sequence for each blob in the phylogenetic network constructed from T , and induces a rooted problem for each connected component, except for the blob bv and component Cv , associated with node v. However, unlike the situation in the proof of Theorem 1 where the label on v can be used for the ancestral sequence of bv , we now have to be more careful in selecting the ancestral sequence for bv . To determine the ancestral sequence for bv , and hence for the entire phylogenetic network (when v is chosen as root of T ), we determine the minimum number of recombination nodes needed to derive M (Cv ) (obeying the chosen crossover model) allowing any possible ancestral sequence. Then, the best ancestral sequence for the phylogenetic network is found by repeating the above computation for each interior node v in T , choosing as the root the node yielding the minimum number of recombinations. The procedure can obviously be sped up by taking advantage of repeated computations. Even if M is not min-decomposable or min-1 decomposable, the procedure above will correctly compute F (M ) or F 1 (M ), which are defined as the minimum number of recombination nodes used in any fully decomposed network for M , allowing multiple crossover or only single-crossover recombination, respectively. The correctness of the procedure follows from Theorem 4 and the fact that placing the root of T at a leaf or on the interior of an edge will not reduce the number of required recombinations in the resulting network. When the appropriate conjecture holds for an input M , we can also compute lower bounds on R(M ), R1 (M ), RS (M ) and RS1 (M ) by computing bounds separately for each M (C), adding these bounds together for a valid overall lower bound. This is correct no matter what lower bound 22 method is used. 5.2 Natural Sufficient Conditions for M to be S-min or S-min-1 decomposable Recall that when M is S-min (respectively S-min-1) decomposable for all S, then M is mindecomposable (respectively min-1 decomposable), and hence we focus here on the rooted versions of the Full-Decomposition Optimality Conjecture. We will prove several sufficient conditions that guarantee that rooted conjectures hold, and therefore also establish combinatorial properties that counterexamples to the conjectures must possess. In a phylogenetic network N for M , LN denotes the set of node labels used in N . By definition, M ⊆ LN . We say that a node is “visible” if it is labeled with a sequence in the input M . So, when LN = M (in the unrooted case) or LN = M + S (in the rooted case), every node in N is visible. The main results we will prove are: Theorem 5. Let N be an optimal phylogenetic network for M with ancestral sequence S, allowing multiple-crossover recombination (respectively allowing only single-crossover recombination), using RS (M ) (respectively RS1 (M )) recombination nodes. Let GS (LN ) be the conflict graph for sequences LN with respect to S. Then M is S-min (respectively S-min-1) decomposable if GS (M ) and GS (LN ) have the same number of connected components. Theorem 5 says that M will be S-min (respectively S-min-1) decomposable unless every optimal phylogenetic network for M , with ancestral sequence S, allowing multiple-crossovers (respectively only allowing single-crossovers) has nodes whose labels, when added to M , causes significant changes to GS (M ). “Significant” here means a change in the number of connected components. Corollary 2. Let N be an optimal phylogenetic network for M with ancestral sequence S, allowing multiple-crossover recombination (respectively allowing only single-crossover recombination). Then M is S-min (respectively S-min-1) decomposable, if every node in N is visible. Since network N in Corollary 2 must be an optimal network, and in practice it may be hard to know that a network is optimal, Corollary 2 may be hard to apply in practice. However, we can prove the following more applicable, but somewhat weaker result: Theorem 6. Let N be a phylogenetic network (which might not be optimal) for M with ancestral sequence S. Then M is S-min or S-min-1 decomposable (depending on the type of crossovers used in N ), if every node in N is visible, and every edge in N is labeled by at most one site. We now begin to prove Theorem 5, Corollary 2, and Theorem 6. Recall that for any sequence Z and connected component C, Z(C) is Z restricted to the sites in C. Let h be a recombinant sequence created by the recombination of parent sequences h1 and h2 . A recombination event (or node) in a network is defined to be a 0-component recombination if for every component C ∈ GS (M ), h(C) = h1 (C) or h(C) = h2 (C). Similarly, a recombination 23 event (or node) is defined to be a 1-component recombination if there is exactly one component C ∈ GS (M ) such that h(C) 6= h1 (C) and h(C) 6= h2 (C). In other words, in a 0-component recombination, the recombinant sequence h, restricted to the sites of any single connected component of GS (M ), is identical to at least one of its parent sequences. Note however that in a 0-component recombination, the full sequence h can differ (and will in an optimal network) from both of its parent sequences. In a 1-component recombination, the above identity property fails for exactly one connected component of GS (M ). Note that the definitions of 0-component and 1-component recombination are independent of any constraints on the number of crossovers allowed. One could define a general notion of k-component recombination, but the above two definitions are sufficient for the following central lemma. Lemma 5. Let N be an arbitrary phylogenetic network for M with ancestral sequence S, and suppose that every recombination in N is either a 0 or 1-component recombination. Let R(N ) denote the number of recombination nodes in N . If multiple-crossover recombinations are allowed in N , then R(N ) ≥ X RS(C) (M (C)), C∈GS (M ) and if only single-crossover recombinations are allowed, then X 1 R(N ) ≥ RS(C) (M (C)). C∈GS (M ) Proof. We first define a map f from the 1-component recombination nodes in N to the connected components of GS (M ). Let v be any 1-component recombination node in N , and let h, h1 , h2 be the recombinant and parent sequences respectively at v. Since there is exactly one connected component C ∗ ∈ GS (M ) for which h(C ∗ ) 6= h1 (C ∗ ) and h(C ∗ ) 6= h2 (C ∗ ), we define the map f (v) of node v to component C ∗ . Now, consider any recombination node v ′ in N that is not mapped to component C ∗ . Restricted to the sites in C ∗ , the recombinant sequence at node v ′ must be identical to the sequence of one of its parents, because either v ′ is a 0-component recombination or a 1-component recombination node which is mapped to a component other than C ∗ . Therefore, if we remove all such recombination nodes from N , and also remove any site that is not in C ∗ , the resulting network is a phylogenetic network NC ∗ that derives the set of sequences M (C ∗ ). Network NC ∗ has ancestral sequence S(C ∗ ) and every recombination node in NC ∗ is a 1-component recombination that maps to C ∗ . By definition of optimality, the number of recombination nodes in NC ∗ , denoted R(NC ∗ ), is at least RS(C ∗ ) (M (C ∗ )). Now, since each recombination node in N is mapped to at most one component, R(N ) ≥ X R(NC ) ≥ C∈GS (M) X RS(C) (M(C)). C∈GS (M) 24 Theorem 7. Let N be a phylogenetic network for M with ancestral sequence S such that every recombination node in N is either a 0 or 1-component recombination node, relative to GS (M ). Then there is fully-decomposed phylogenetic network N ′ for M , with ancestral sequence S, using at most R(N ) recombination nodes, where both N and N ′ allow the same type of recombinations (single-crossover only or multiple crossover). Proof. This follows directly from Lemma 5 and Theorem 3. Corollary 3. When recombinations only model back or recurrent mutations (as detailed in Section 4.1), then M is S-min decomposable for any sequence S, and the maximum parsimony problem for M can be solved by separately solving a maximum parsimony problem for the sites from each connected component of GS (M ). Proof. Consider the tree T (M ) with ancestral sequence S that solves the maximum parsimony problem when S is the required ancestral sequence and back and recurrent mutations are allowed. Let N be the phylogenetic network that implements the same derivation from S, using recombinations (as detailed in Section 4.1) to model the back or recurrent mutations. Each recombination event is a two-crossover recombination where the two crossovers occur just before and just after a single site i in a connected component C ∈ GS (M ). Therefore, for every connected component C ′ 6= C, the recombinant sequence is identical to one of its parent sequences at the sites in C ′ , and so the recombination event is either a 0-component or a 1-component recombination, and Theorem 7 applies. 5.2.1 Spatial disjointness is a sufficient condition Let {1, 2 . . . m} denote the given ordering of the sites of M , and let S by a given sequence. M is said to be “spatially disjoint” if for every connected component C of GS (M ), the sites in C form a contiguous interval in the ordered set of sites. Theorem 8. Given S, if M is spatially disjoint, then M is S-min-1 decomposable. Proof. Let N be phylogenetic network N for M , with ancestral sequence S, using R1 (M + S) single-crossover recombinations. Since M is spatially disjoint, a single-crossover falls either within the interval of sites for a single connected component of GS (M ), or between two such intervals. In the former case, the recombination must be either a 0 or a 1-component recombination, and in the latter case, the recombination must be a 0-component recombination. The theorem then follows from Theorem 7. Note that Theorem 8 is proven only for the case of single-crossover recombination. 25 5.2.2 Component respect is a sufficient condition Recall that for a phylogenetic network N , LN is the set of sequences that label the nodes of N . If the addition of the sequences LN − M to M + S does not create any incompatibilities between sites in different components of GS (M ), then we say that LN respects (the component structure of) GS (M ). Since M + S ⊆ LN , any incompatible pair in M + S is incompatible in LN , so LN respects GS (M ) if and only if GS (LN ) and GS (M ) have the same number of components, although they need not be identical graphs. Also, if all nodes in N are visible, then LN = M + S, and so LN trivially respects GS (M ). Theorem 9. Let N be a phylogenetic network for M with ancestral sequence S, and suppose that LN respects GS (M ). Then there is fully-decomposed phylogenetic network N ′ for M , with ancestral sequence S, using at most R(N ) recombination nodes, where N and N ′ allow the same type of recombinations (single-crossover only or multiple crossover). Proof. Consider a recombination in N between sequences h1 and h2 resulting in recombinant sequence h. We will show that this recombination is a 0-component or a 1-component recombination with respect to the components of GS (M ). If it is not a 0-component or a 1-component recombination, then there must be two connected components Ca and Cb in GS (M ) such that h(Ca ) 6= h1 (Ca ), h(Ca ) 6= h2 (Ca ), h(Cb ) 6= h1 (Cb ), and h(Cb ) 6= h2 (Cb ). We will show that no such pair of connected components exists. Consider a trivial connected component C in GS (M ). Since C consists of just one site, if h1 (C) 6= h2 (C) then h(C) = h1 (C) or h(C) = h2 (C), and if h1 (C) = h2 (C) then h(C) = h1 (C). In either case, C can be neither Ca nor Cb . So if Ca and Cb exist, they must both be non-trivial connected components in GS (M ). Consider two non-trivial connected components C and C ′ in GS (M ), and let dC be the dominant sequence in M (C) with respect to (C, C ′ ), and let dC ′ be the dominant sequence in M (C ′ ) with respect to (C ′ , C). Now LN respects GS (M ), so by Lemma 2, Corollary 1 applies to the sequences LN , and hence either h(C) = dC , or h(C ′ ) = dC ′ , or both. We will examine the case that h(C) = dC (the other case is symmetric and omitted). If either h1 (C) = dC or h2 (C) = dC , then C is neither Ca nor Cb . Conversely, if neither h1 (C) nor h2 (C) is dC , then by Corollary 1, h1 (C ′ ) = h2 (C ′ ) = dC ′ , and h(C ′ ) = dC ′ no matter where any crossovers occur in the recombination of h1 and h2 . In that case, C ′ is neither Ca nor Cb . Hence from the assumptions that LN respects GS (M ) and that C and C ′ are non-trivial components of GS (M ), we have established that the pair C, C ′ cannot be the (unordered) pair Ca , Cb . Therefore, the pair Ca , Cb cannot exist, and so the recombination must be a 0-component or a 1-component recombination with respect to GS (M ). The theorem then follows by applying Theorem 7. Theorem 5 and Corollary 2 follow immediately. Corollary 2 was proven with a more complex proof in Gusfield (2005b). One application of Theorem 5 is to the recently introduced non26 degenerate galled-networks (Huson and Klopper, 2007). It is easy to prove that for any nondegenerate galled-network N for M , a pair of sites is incompatible in LN if and only if it is incompatible in M , and therefore G(LN ) and G(M ) are identical. 5.2.3 The tight haplotype bound is a sufficient condition Recall that M is assumed to have no identical rows. The following lower bound on R1 (M ), called the haplotype bound and denoted H(M ), was developed by Myers and Griffiths (2003): H(M ) equals the number of rows of M , minus the number of distinct columns of M , minus one. Although originally proved for single-crossover recombination, the same proof establishes that H(M ) is also a lower bound on R(M ). The haplotype bound is computed on M + S when an ancestral sequence S is specified. Let N be an optimal phylogenetic network for M , with ancestral sequence S. From the proof given in Myers and Griffiths (2003) (or see Song et al. 2005) that H(M + S) is a lower bound on R1 (M + S), it follows that H(M + S) = R(N ) = R∞ (M + S) (or R(M + S)) only when every node in N is labeled by a sequence in M + S. Hence, when the haplotype bound is tight for the appropriate crossover model, every node in N must be visible and we can apply Corollary 2 to obtain: Theorem 10. For a set of sequences M and a specified ancestral sequence S, if H(M + S) = R1 (M + S) then M is S-min-1 decomposable, and if H(M + S) = R(M + S) then M is S-min decomposable. We can also now provide the Proof of Theorem 6: First, we can modify N , while maintaining the conditions of the theorem, so that N does not contain two recombination nodes labeled by the same sequence, and for any non-recombination node v with a directed, labeled (by a site) edge into v, the sequence labeling v does not label any other node. Then, since every node in N is visible and each edge is labeled by at most one site, R(N ) must be exactly the number of sequences in M + S, minus the number of labeled edges in N , minus 1. But, the number of labeled edges in N is greater than or equal to the number of distinct sites in M , so R(N ) ≤ H(M + S), and R(N ) = H(M + S). Therefore, H(M + S) = R1 (M + S) if N only uses single-crossover recombination, and H(M + S) = R(M + S) if N uses multiple-crossover recombination. Theorem 6 then follows from Theorem 10. 5.3 The Full-Decomposition Optimality Conjectures are False In this section we show that the Unrooted Full-Decomposition Optimality Conjectures are false, which also implies that the rooted versions are false. This also establishes that Conjecture 1, recently stated in Huson and Klopper (2007), is false. We also show that there is no bound on the deviation between R1 (M ) and F 1 (M ) (the minimum number of single-crossover recombination nodes needed 27 2 66 M1 = 666 4 001000 011010 010010 000100 100101 100001 3 77 77 75 2 3 C1 5 1 4 6 C2 G(M1 ) Figure 6: A 6 by 6 binary matrix M and its incompatibility graph G(M ), consisting of two connected components C1 and C2 . 3 000000 4 5 1 S P 4 2 P 001000 6 3 S P 011010 010010 5 S 100001 100101 000100 Figure 7: A phylogenetic network N for the sequences in Figure 6. Closed and open circles denote mutation and recombination events, respectively. in a fully-decomposed network for M ); but in the simulations we conducted, counterexamples are rare, and deviations are small. 5.3.1 The single-crossover case Consider the 6 by 6 matrix M shown in Figure 6. Its incompatibility graph G(M ) consists of two connected components C1 and C2 , shown on the right hand side of Figure 6. It is easy to see that R1 (M (C1 )) ≥ 2 and R1 (M (C2 )) ≥ 2. For example, the HK lower bound (Hudson and Kaplan, 1985) on R1 (M (C1 ) is 2 as is the HK lower bound on R1 (M (C2 )). (In fact, R1 (M (C1 )) = R1 (M (C2 )) = 2, but this is not needed.) Certainly then, F 1 (M ) ≥ 4. However, R1 (M ) ≤ 3 (as shown in Figure 7), so F 1 (M ) > R1 (M ), showing that the single-crossover case of the Unrooted Full-Decomposition Optimality Conjecture is false. Note that in the example in Figure 6, every pair of columns already contains the binary pair 0,0, so if we add the ancestral sequence S = 000000 to M , no additional incompatible pairs would be created and so GS (M ) = G(M ). Then, since RS1 (M (Ck )) ≥ R1 (M (Ck )) for k = 1, 2, the example also explicitly shows that the single-crossover case of the rooted version of the Full-Decomposition 28 A B C D E F G H I J K L M = = = = = = = = = = = = = 00000 01000 11000 11100 01010 01011 10111 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 01000 11000 11100 01010 01011 10111 Figure 8: Counterexample to the unrooted multiple-crossover version of the Full-Decomposition Optimality Conjecture. The incompatibility graph contains two connected components. One component contains sites 1 through 5, and the other contains sites 6 through 10. Optimality Conjecture is false. Note also, that the first recombination creates the sequence 100010; when it is added to M , sites 1 and 5 become incompatible, changing the number of connected components in GS (M ) as required by Theorem 9. This counterexample is also consistent with Theorems 8 and 10, since the sites of the two components are not spatially disjoint, and also the haplotype of one is not tight. 5.3.2 The multiple-crossover case Figure 8 shows a set of sequences M whose incompatibility graph (not shown) contains two connected components C1 and C2 . Any phylogenetic network for M (C1 ) that uses multiple-crossover recombinations, requires at least two recombination nodes. This was verified by running program multicross.pl, available at http://wwwcsif.cs.ucdavis.edu/ gusfield/ and described in Gusfield (2005a). Program multicross.pl determines whether or not a set of sequences can be generated on a galled-tree using multiple-crossover recombination; the program determined that M (C1 ) cannot be so generated. Since a network with one recombination node is a galled-tree, the fact that M (C1 ) cannot be generated on a galled tree using multiple-crossover recombination (and it can’t be generated with zero recombinations) proves that at least two multiple-crossover recombinations are required. Similarly, we used multicross.pl to verify that any network that generates M (C2 ) requires at least two multiple-crossover recombinations. It follows that for the sequences M given in Figure 8, F (M ) ≥ 4. Figure 9 shows a phylogenetic network that generates M with only three recombination nodes, using multiple-crossover recombination. Hence, the unrooted, multiple-crossover version of the Full-Decomposition Optimality Conjecture is false. 29 2 7 1 3 x 6 x: 1110000000 ⊗ 0000011100 → 1010010100 8 y: 0101100000 ⊗ 1010010100 → 1011100000 z: 0000001011 ⊗ 1010010100 → 0000010111 9 4 5 y z 10 A BE F G C D J I M L KH Figure 9: A phylogenetic network, with all-zero ancestral sequence, generating the sequences in Figure 8, allowing multiple-crossover recombination. Filled circles denote mutations and open circles denote the recombination nodes x, y, z. The parental and recombinant sequences at those three nodes are shown in detail. The symbol ⊗ denotes recombination. Note the recombination event at node y uses only a single-crossover, while multiple-crossovers are used at nodes x and z. 5.3.3 Unbounded Deviation Recall that F 1 (M ) denotes the minimum number of single-crossover recombination nodes used in any fully decomposed phylogenetic network for M . Theorem 11. For any positive integer d, there exists a set of sequences M such that F 1 (M ) − R1 (M ) ≥ d. P 1 Proof. Clearly, F 1 (M ) ≥ C∈G(M ) R (M (C)), so it suffices to show that there exists a set of P sequences M such that [ C∈G(M ) R1 (M (C))] − R1 (M ) ≥ d. Let Md be the 6 by 6d matrix obtained by concatenating d copies of M from Figure 6 as follows: Md = In other words, Md contains d repetitions    0   1        1    x =  , y =    0       0  0 M M ··· M of uxyvxu, where    0 1    1   0     0   0  , u =  ,  0  0      1  0  1 0 .      v=    0 0 0 1 1 0      .    There are two connected components, C1 and C2 , in G(Md ), the first only containing columns of type x or y, and the second only containing columns of type u or v. It is straightforward to show 30 1 3 2 5 Tux 6 4 1 2 Ty 53 6 4 1 3 62 Tv 5 4 Figure 10: Three leaf-labeled time-ordered trees with 6 leaves. that R1 (Md (C1 )) = R1 (Md (C2 )) = 2d. Therefore, R1 (Md (C1 )) + R1 (Md (C1 )) = 4d. We next show that R1 (Md ) ≤ 3d, which will prove the theorem. A rooted tree T is said to be compatible with a column i in M , if there is some edge in T whose removal creates one subtree with leaves labeled by all the rows with state 0 in column i, and one subtree with leaves labeled by all the rows with state 1 in column i. For example, tree Ty in Figure 10 is compatible with every column of type y in Md . A rooted tree T is said to be time-ordered if each internal node v in T can be assigned a numerical “time” tv , so that the times strictly increase along any directed path from the root. In Figure 10, the times of the nodes are represented by the heights of the nodes, with higher nodes representing smaller (earlier) times. Given a time-ordered tree T , a subtree-prune-and-regraft (SPR) operation removes (prunes) one edge e in T , and reconnects (regrafts) the subtree τ below e to some point in T , using a new edge e′ to connect the root of τ to T . In an ordered-SPR operation, the times assigned to all the nodes in τ must remain the same, and the resulting tree must also be time-ordered. It follows that the point where edge e′ connects to T must be earlier in time than the time assigned to the root of τ . See Song (2006a) for a more detailed definition of an ordered-SPR operation. Given two time-ordered trees T and T ′ with the same number of leaves, doSPR (T, T ′ ) denotes the minimum number of ordered-SPR operations needed to transform T into T ′ . For example, in Figure 10, tree Ty is transformed to Tv by using one ordered-SPR operation which cuts the subtree consisting of leaves 3 and 6, and then grafts it into the edge leading to leaf 2. Given a length-ℓ sequence P = T1 , T2 . . . , Tℓ of time-ordered leaf-labeled trees with n leaves, we define Pℓ−1 L(P ) = i=1 doSPR (Ti , Ti+1 ). The following fact was established in Song and Hein (2005): For any n by m matrix M , if P = T1 , T2 , . . . , Tm is a sequence of time-ordered leaf-labeled trees in which Ti is compatible with site i in M , for all i = 1, . . . , m, then R1 (M ) ≤ L(P ). We can now show that R(Md ) ≤ 3d. Consider the three time-ordered leaf-labeled trees shown in Figure 10. Note that Tux is compatible with both columns u and x, Ty with column y, and Tv with column v. Let Pd be the length-6d sequence of trees containing d repetitions of P1 = 31 Table 1: The frequency of data sets (out of 10, 000) where a counterexample to the FullDecomposition Optimality Conjecture was observed. n m ρ #datasets with k ≥ 2 #counterexamples 15 20 10 389 0 15 30 10 465 2 20 20 10 527 0 20 30 10 546 3 15 20 15 372 1 15 30 15 388 1 20 20 15 482 1 20 30 15 396 0 The variable k denotes the number of non-trivial connected components in the incompatibility graph. We used Hudson’s MS (Hudson, 2002) to simulate n by m data sets for given recombination rate ρ. Tux Tux Ty Tv Tux Tux . Then, since doSPR (Tux , Ty ) = doSPR (Ty , Tv ) = doSPR (Tv , Tux ) = 1, we obtain L(Pd ) = 3d ≥ R1 (Md ). 5.4 Empirical Study To obtain some indication of how often the single-crossover Unrooted Full-Decomposition Optimality Conjecture is violated in practice, we carried out an small simulation study using the program beagle (Lyngso et al., 2005) which implements a branch and bound strategy to exactly compute R1 (M ). We determine if a given set of sequences M is a counterexample to the conjecture as follows. We first find the tree T from M , and then compute F 1 (M ) as detailed in Section 5.1. We use beagle to solve each of the rooted and unrooted problems that arise in that computation, and to compute R1 (M ). As detailed in Section 5.1, the procedure for computing F 1 (M ) is correct, and so M is a counterexample to the single-crossover, Unrooted Full-Decomposition Optimality Conjecture if and only if the computed value F 1 (M ) is strictly greater than R1 (M ). Let ρ := 4Ne c, where Ne denotes the effective population size and c the recombination rate per generation per sequence. For fixed n, m, and ρ, we used Hudson’s program MS (Hudson, 2002) to generate 10, 000 n by m simulated data sets generated under the coalescent model with recombination. Table 1 gives a summary of the number of times (out of 10, 000) that a counterexample was identified. (Note that we could not consider large data sets since beagle cannot compute the 32 minimum number of recombinations for large data sets.) As the table shows, counterexamples occur infrequently in the data that we examined. In these eight counterexamples, R1 (M ) ranges from three to nine; F 1 (M ) is always R(M ) + 1; and G(M ) always has two non-trivial connected components. Of course, we do not know how robust these empirical results are in other ranges of data, or to multiple-crossover recombination, or when there is a fixed ancestral sequence. Acknowledgments This research is supported in part by NSF grants CCF-0515278 and IIS-0513910 (DG and YSS), and by NIH grant 1K99-GM080099 (YSS). References Bafna, V. and Bansal, V. 2006a. Inference about recombination from haplotype data: Lower bounds and recombination hotspots. J. of Comp. Biology, 13, 501–521. Bafna, V. and Bansal, V. 2004. The number of recombination events in a sample history: conflict graph and lower bounds. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1, 78–90. Bafna, V. and Bansal, V., 2006b. preprint. Bandelt, H. J., Foster, P., Sykes, B., and Richards, M. 1995. Mitochondrial portaits of human populations using median networks. Genetics, 141, 743–753. Chakravarti, A. 1998. It’s raining SNP’s, hallelujah? Nature Genetics, 19, 216–217. Clark, A. G. 2003. Finding genes underlying risk of complex disease by linkage disequilibrium mapping. Current Opinion in Genetics & Development, 13, 296–302. Felsenstein, J. Inferring Phylogenies. Sinauer, Sunderland, MA., 2004. Griffiths, R. C. and Marjoram, P. 1996. Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol., 3, 479–502. Gusfield, D. 1991. Efficient algorithms for inferring evolutionary history. Networks, 21, 19–28. Gusfield, D. 2005a. Optimal, efficient reconstruction of Root-Unknown phylogenetic networks with constrained and structured recombination. JCSS, 70, 381–398. Gusfield, D. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK, 1997. 33 Gusfield, D. On the decomposition optimality conjecture for phylogenetic networks. Technical report, UC Davis, Department of Computer Science, 2005b. Gusfield, D. and Bansal, V. A fundamental decomposition theory for phylogenetic networks and incompatible characters. In Proc. of RECOMB 2005: The 9th Ann. International Conference Research in Computational Molecular Biology, pages 217–232. Springer, LNBI 3500, 2005. Gusfield, D., Eddhu, S., and Langley, C. 2004a. The fine structure of galls in phylogenetic networks. INFORMS J. on Computing, special issue on Computational Biology, 16, 459–469. Gusfield, D., Eddhu, S., and Langley, C. 2004b. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. J. Bioinformatics and Computational Biology, 2,(1) 173–213. Gusfield, D., Hickerson, D., and Eddhu, S. 2007. An efficiently-computed lower bound on the number of recombinations in phylogenetic networks: Theory and empirical study. Discrete Applied Math, Special issue on Computational Biology, 2007, 155, 806–830. Hein, J. 1990. Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci, 98, 185–200. Hein, J. 1993. A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol., 36, 396–405. Hein, J., Schierup, M., and Wiuf, C. Gene Genealogies, Variation and Evolution: A primer in coalescent theory. Oxford University Press, UK, 2005. Hinds, D., Stuve, L., Nilsen, G., Halperin, E., Eskin, E., Gallinger, D., Frazer, K., and Cox, D. 2005. Whole-genome patterns of common DNA variation in three human populations. Science, 307, 1072–1079. Hudson, R. 2002. Generating samples under the Wright-Fisher neutral model of genetic variation. Bioinformatics, 18,(2) 337–338. Hudson, R. and Kaplan, N. 1985. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics, 111, 147–164. Huson, D. and Klopper, T. Beyond galled trees - decomposition and computation of galled networks. In Speed, T. and Huang, H., editors, Proc. of RECOMB 2007: The 11th Ann. International Conference Research in Computational Molecular Biology, pages 211–225. Springer, LNBI 4453, 2007. 34 Huson, D., Klopper, T., Lockhart, P., and Steel, M. Reconstruction of reticulate networks from gene trees. In Proc. of RECOMB 2005: The 9th Ann. International Conference Research in Computational Molecular Biology, pages 233–249. Springer, LNBI 3500, 2005. Kececioglu, J. D. and Gusfield, D. 1998. Reconstructing a history of recombinations from a set of sequences. Discrete Applied Math., 88, 239–260. Lyngso, R., Song, Y., and Hein, J. Minimum recombination histories by branch and bound. In Proceedings of Workshop on Algorithm of Bioinformatics (WABI) 2005, volume 3692, pages 239–250, Berlin, Germany, 2005. Springer-Verlag LNCS. Minichiello, M. and Durbin, R. 2006. Mapping trait loci using inferred ancestral recombination graphs. Am. J. Hum. Genet., 79, 910–922. Moret, B., Nakhleh, L., Warnow, T., Linder, C., Tholse, A., Padolina, A., Sun, J., and Timme, R. 2004. Phylogenetic networks: Modeling, reconstructibility, and accuracy. IEEE/ACM Transactions on Computational Biology and Bioinformatics, pages 13–23. Morrison, D. 2005. Networks in phylogenetic analysis: new tools for population biology. International J. for Parasitology, 35, 567–582. Myers, S. The detection of recombination events using DNA sequence data. PhD thesis, University of Oxford, Oxford England, Department of Statistics, 2003. Myers, S. R. and Griffiths, R. C. 2003. Bounds on the minimum number of recombination events in a sample history. Genetics, 163, 375–394. Nakhleh, L., Sun, J., Warnow, T., Linder, C., Moret, B., and Tholse, A. Towards the development of computational tools for evaluating phylogenetic network reconstruction methods. In Proc. of 8’th Pacific Symposium on Biocomputing (PSB 03), pages 315-326, 2003. Nakhleh, L., Warnow, T., and Linder, C. Reconstructing reticulate evolution in species - theory and practice. In Proc. of 8’th Annual International Conference on Computational Molecular Biology, pages 337–346, 2004. Norborg, M. and Tavare, S. 2002. Linkage disequilibrium: what history has to tell us. Trends in Genetics, 18, 83–90. Posada, D. and Crandall, K. 2001. Intraspecific gene genealogies: trees grafting into networks. Trends in Ecology and Evolution, 16, 37–45. Semple, C. and Steel, M. Phylogenetics. Oxford University Press, UK, 2003. 35 Song, Y. S. 2006a. Properties of subtree-prune-and-regraft operations on totally-ordered phylogenetic trees. Annals of Combinatorics, 10, 129–146. Song, Y. 2006b. A concise necessary and sufficient condition for the existence of a galled-tree. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3, 186–191. Song, Y. S. and Hein, J. 2005. Constructing minimal ancestral recombination graphs. J. Comput. Biol, 12, 147–169. Song, Y. S. and Hein, J. Parsimonious reconstruction of sequence evolution and haplotype blocks: Finding the minimum number of recombination events. In Proc. of 2003 Workshop on Algorithms in Bioinformatics, Berlin, Germany, 2003. Springer-Verlag LNCS. Song, Y. S. and Hein, J. 2004. On the minimum number of recombination events in the evolutionary history of DNA sequences. Journal of Mathematical Biology, 48, 160–186. Song, Y. S., Wu, Y., and Gusfield, D. 2005. Efficient computation of close lower and upper bounds on the minimum number of needed recombinations in the evolution of biological sequences. Bioinformatics, 21, i413–i422. Bioinformatics Suppl. 1, Proceedings of ISMB 2005. Song, Y. S., Ding, Z., Gusfield, D., Langley, C. H., and Wu, Y. Algorithms to distinguish the role of gene-conversion from single-crossover recombination in the derivation of SNP sequences in populations. In Proceedings of the 10’th Annual International Conference on Research in Computational Molecular Biology (RECOMB), volume 3909, pages 231–245. Springer LNBI, 2006. Steel, M., 2005. personal communication. Wang, L., Zhang, K., and Zhang, L. 2001. Perfect phylogenetic networks with recombination. Journal of Computational Biology, 8, 69–78. Wu, Y., 2005. Personal Communication. Wu, Y. Association mapping of complex diseases with ancestral recombination graphs: Models and efficient algorithms. In Speed, T. and Huang, H., editors, Proc. of RECOMB 2007: The 11th Ann. International Conference Research in Computational Molecular Biology, pages 488–502. Springer, LNBI 4453, 2007. Zollner, S. and Pritchard, J. 2005. Coalescent-based association mapping and fine mapping of complex trait loci. Genetics, 169, 1071–1092. 36

Log In

A Decomposition Theory for Phylogenetic Networks and Incompatible Characters