Counting and enumerating galled networks

Jeyaram Rathin

Counting and enumerating galled networks

Discrete Applied Mathematics

arXiv:1812.08569v1 [q-bio.PE] 20 Dec 2018 Counting and Enumerating Galled Networks Andreas DM Gunawan∗, Jeyaram Rathin†, Louxin Zhang‡ December 21, 2018 Abstract Galled trees are widely studied as a recombination model in population genetics. This class of phylogenetic networks is generalized into galled networks by relaxing a structural condition. In this work, a linear recurrence formula is given for counting 1galled networks, which are galled networks satisfying the condition that each reticulate node has only one leaf descendant. Since every galled network consists of a set of 1-galled networks stacked one on top of the other, a method is also presented to count and enumerate galled networks. 1 Introduction Phylogenetic networks have been used more and more frequently in evolutionary genomics and population genetics in the past two decades [12, 18]. A rooted phylogenetic network (RPN) is a rooted acyclic digraph in which all the sink nodes are of indegree 1 and there is a unique source node called the root, where the former represent a set of taxa (e.g, species, genes, or individuals in a population) and the latter represents the least common ancestor of the taxa. Moreover, RPNs also satisfy the property that non-leaf and non-root nodes are of either indegree 1 or outdegree 1; these nodes are called tree nodes and reticulate nodes, respectively. Imposing topological conditions on the network allows us to define different classes of RPNs such as galled trees [13, 22], galled networks [15], tree-child networks [4], reticulationvisible networks [17] and tree-based networks [7, 24] (see also [21, 25]). A galled tree is a binary RPN such that (i) for each reticulate node u, with the parents being denoted as p′ (u) and p′′ (u), there are two edge-disjoint paths from the least common ancestor of p′ (u) and p′′ (u) to u that contain only tree nodes except for u, and (ii) for any two reticulate nodes, the paths in (i) do not overlap [22]. Later, galled networks were defined by Huson and Klöpper to be RPNs that satisfy only Property (i) [15]. Reconstruction of galled networks has also ∗ Department of Mathematics, National University of Singapore, Singapore 119076 Department of Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore - 641004, India. This work was done when he visited the National University of Singapore as an exchange student. ‡ Department of Mathematics, National University of Singapore, Singapore 119076. Email: matzlx@nus.edu.sg † 2 BASIC NOTATION 2 been studied in [16]. These network classes are of particular interest because they have nice combinatorial properties. Moreover, some important NP-complete problems related to phylogenetic trees and clusters can be solved in polynomial-time when restricted to these classes [1, 9, 10, 23]. In this paper, we investigate how many galled networks exist over a set of taxa. Phylogenetic trees are RPNs without any reticulate node. It is well known that (2n − 3)!! binary phylogenetic trees exist over n taxa. However, counting becomes much harder for general RPNs. For example, even counting RPNs with a couple of reticulate nodes is challenging [8]. Recent advances in counting have been made for tree-child networks [8, 19] and galled trees [2, 20]. Here, we provide a linear recurrence formula for finding the number of 1-galled networks, which are galled networks such that each reticulate node has only one leaf descendant. The formula is obtained through a connection between galled networks and leaf-multi-labeled (LML) trees. Since a galled network is essentially a set of 1-galled networks stacked one on top of the other in a tree-like structure, we also present a general method for counting and enumerating galled networks. Although counting LML trees was investigated by Czabarka et al. [5], our results are not derived from their study. The rest of this paper is divided into five sections. Section 2 introduces some basic notation that are necessary for our study. Section 3 establishes the fact that 1-galled networks have a one-to-one correspondence with the so-called dup-trees. Section 4 presents a linear recurrence formula for counting 1-galled networks. Section 5 examines how to count and enumerate general galled networks. Section 6 concludes the study with a few remarks. 2 Basic notation 2.1 Phylogenetic networks A binary RPN over a finite set of taxa X is an acyclic digraph such that: • there is a unique node of indegree 0 and outdegree 2, called its root; • there are exactly |X| nodes of indegree 1 and outdegree 0, called the leaves of the RPN, each labeled with a unique taxon in X; and • each non-leaf/root node is either a reticulate node that is of indegree 2 and outdegree 1, or a tree node of indegree 1 and outdegree 2. Three RPNs are illustrated in Figure 1, where each edge is directed away from the root and edge orientation is not omitted. For a RPN N , we use V(N ) and A(N ) to denote its node set and directed edge set, respectively. Let u and v be two nodes of N . The node u is said to be a parent (resp. a child) of v if (u, v) ∈ A(N ) (resp. (v, u) ∈ A(N )). Each reticulate node r has a unique child and we denote this as c(r). Each tree node t has a unique parent and we denote this as p(t). In general, u is an ancestor of v (or equivalently, v is below u) if there is a direct path from the root of N to v that contains u. A binary phylogenetic tree over a set of taxa X is simply a binary RPN containing no reticulate nodes. 3 2 BASIC NOTATION a b 2 1 c 1 3 2 3 1 d 2 3 1 1 2 3 3 Figure 1: RPNs and trees over {1, 2, 3}, where reticulate and tree nodes are drawn as filled triangles and open circles, respectively. (a) A binary 𝑀 RPN. (b) A binary galled network. (c) 2 A binary phylogenetic tree. (d) A rooted binary dup-tree, where the labels ‘1’ and ‘3’ are duplicated labels. 𝑀1 𝑀3 A RPN is said to be galled if every reticulate node r has an ancestor ar such that there 𝑀2 nodes other than r. are edge-disjoint paths from ar to r that𝑀do not contain any reticulate 2 The RPN in Figure 1a is not galled but the one in Figure 1b is. By definition, every RPN with only one reticulate node or without reticulate nodes is galled. 𝑀1 𝑀 𝑀 1 2.2 Dup-trees 𝑀3 3 In this work, we will count galled networks through the connection between galled networks and the so-called LML trees. A rooted (resp. unrooted) LML tree is a binary rooted (resp. unrooted) tree with leaves that are labeled in a way such that several leaves may have an identical label. It is a duptree if at most two leaves have the identical label. A rooted dup-tree is given in Figure 1d. Here, a phylogenetic tree is considered to be a trivial rooted dup-tree. The child–parent and ancestor–descendant relationships can be defined for nodes in a rooted dup-tree in the same way as in a RPN. Let M be a dup-tree over X. A taxon x ∈ X is said to be a duplicated label for M if two distinct leaves labeled with x exist and a 1-label otherwise. L1 (M ) and L2 (M ) are used to denote the subsets of the 1-labels and duplicated labels in M , respectively. A cherry in a dup-tree is a pair of leaves that are adjacent to a common non-leaf node. A cherry is said to be a twin-cherry if two leaves belonging to it are labeled with a common taxon. A dup-tree is said to be twin-cherry-free if it does not contain any twin-cherries. Let M be a unrooted LML tree over X, x ∈ X and e = (u, v) ∈ A(M ). Grafting a new leaf x to e involves replacing e by a path consisting of two paths (u, p) and (p, v), and attaching the leaf as the child of p, where p is not in M . Conversely, for a leaf ℓ in M , its parent p(ℓ) is adjacent to two nodes x and y other than ℓ. Pruning ℓ from M means removing ℓ and p(ℓ) and any incident edges and then adding (x, y) as an edge to M . In this work, we use M ⊕ (e, x) to denote the tree obtained from grafting x to e in M , or M ⊕ x if there is no confusion if e is omitted. Similarly, M ⊖ ℓ is used to denote the tree obtained from M by pruning ℓ for a leaf ℓ in M . 3 DUP-TREES AND 1-GALLED NETWORKS 2.3 4 Decomposition of galled networks into tree-components Consider a RPN N . Let R(N ) and L(N ) denote the sets of reticulate nodes and leaves in N , respectively. The subnetwork N − (R(N ) ∪ L(N )) is a forest for which each connected component consists of tree nodes. Each connected component is called a tree-component of N [11, 25]. Note that each tree-component does not contain any leaves. This is different from the definition of tree-components given in [10]. A reticulate node is inner if both its parents are in a common tree-component. It is a cross reticulate node otherwise. Galled networks have the following recursive characterization. Theorem 1 Let G be a galled network. (1) Each reticulate node is inner in G. (2) For any r ∈ R(G), G − {r} consists of two connected components, and the component contains all the descendants of r form a galled subnetwork rooted at the child c(r) of r. Proof. (1) It has been proven [10, Theorem 2] that a binary RPN is galled if and only if every reticulate node is inner. (2) Clearly, the statement follows from Part 1. A RPN is a 1-galled network if it is a galled network with only one tree-component. The RPN in Figure 1b is 1-galled. It is easy to derive the following facts from Theorem 1. Corollary 1 Let N be a RPN. (1) If there is only one tree-component in N , then N is galled. (2) If every reticulate has only one leaf descendant in N , then N is 1-galled. 3 Dup-trees and 1-galled networks Let M be a dup-tree over X. Recall that L2 (M ) denotes the subset of duplicated labels. For each x ∈ L2 (M ), we use ℓ′ (x) and ℓ′′ (x) to denotes the two leaves that are labeled with x. Let us assume that M is twin-cherry-free. We derive a RPN N (M ) by (i) removing ′ ℓ (x) and ℓ′′ (x), (ii) introducing a reticulate node rx , (iii) connecting the parents p(ℓ′ (x)) and p(ℓ′′ (x)) of x to rx , and (iv) attaching a leaf ℓx with the label below rx for each duplicated label x. Formally, N (M ) = (V̄ , Ā), where: V̄ = [V(M ) − {ℓ′ (x), ℓ′′ (x) | x ∈ L2 (M )}] ∪ {rx , ℓx | x ∈ L2 (M )} , Ā = [A(M ) − {(p(ℓ′ (x)), ℓ′ (x)), (p(ℓ′′ (x)), ℓ′′ (x)) | x ∈ L2 (M )}] ∪ {(p(ℓ′ (x)), rx ), (p(ℓ′′ (x)), rx ), (rx , ℓx ) | x ∈ L2 (M )} . (1) (2) If M is a phylogenetic tree, N (M ) is just M . If M is a dup-tree containing at least one duplicated label, N (M ) is then a 1-galled network containing as many reticulate nodes as the duplicated labels in M . This transformation from a LML tree to a network is called the “folding” operation in [14]. Conversely, it is not hard to see that splitting each reticulate node in a 1-galled network N results in a dup-tree M such that N (M ) = N . This proves the following statement: 4 COUNTING 1-GALLED NETWORKS 5 Theorem 2 Let X be a finite set and r ≥ 1. There is a one-to-one correspondence between • The binary twin-cherry-free dup-trees with r duplicated labels over X, and • The binary 1-galled networks with r reticulate nodes. Note that the 1-galled network in Figure 1b corresponds with the dup-tree in Figure 1d. 4 Counting 1-galled networks Without loss of generality, we set [k] = {1, 2, ..., k}. We adopt the following notation: • Tk is the set of phylogenetic trees over [k]. • UT k is the set of binary unrooted trees over [k]. • Di,k is the set of rooted dup-trees M over [k] such that M is twin-cherry-free and L2 (M ) = [i], where 1 ≤ i ≤ k. • UDi,k is the set of unrooted dup-trees M over [k] such that M is twin-cherry-free and and L2 (M ) = [i], where 1 ≤ i < k. • Gi,k is the set of 1-galled networks over [k] that has exactly i reticulate nodes with the child being labeled with a unique element in [i]. Lemma 1 For any k ≥ 1, |Tk | = |UT k+1 | = (2k − 3)!! = (2k − 3) × (2k − 5) × · · · × 3 × 1, |Gi,k | = |Di,k | = |UDi,k+1 |, i ≤ k. (3) (4) Proof. The first equation is well known (see [21, page 16]). Similarly, by Theorem 2, the second equation is also true. 4.1 A recursive formula for |UDi,k | and |Gi,k | It is well known that every binary unrooted tree over [k + 1] can be obtained from a unique binary unrooted tree over [k] by inserting the leaf labeled with (k + 1) on an edge of the latter. In the section, we generalize this fact to give a recurrence formula for |UDi,k | and |Gi,k |. As a warmup, we first count the dup-trees in UD1,k for k ≥ 2. For simplicity, we set UD0,k = UT k . Let M ∈ UD1,k . Since M is twin-cherry-free, the leaves labeled with 1 are not sibling and thus M can be partitioned into three parts (M1 , M2 , M3 ) as illustrated in Figure 2. Pruning different leaves labeled with 1 from M results in different trees in UD0,k if the middle subtree M3 is not empty, and the same tree otherwise. This suggests that grafting an extra leaf labeled with 1 into every edge in each tree in UD0,k can generate every tree in UD1,k twice. Note that if we graft the extra leaf into the edge incident to the original leaf labeled with 1, we get a dup-tree in which two leaves of label 1 form a twin-cherry, which is not in UD1,k . 6 4 COUNTING 1-GALLED NETWORKS 1 Prune Leaf 1 𝑀1 1 𝑀1 𝑀3 1 𝑀3 𝑀2 𝑀2 Prune Leaf 1 𝑀2 1 𝑀1 𝑀3 Figure 2: A dup-tree M in UD1,k can be partitioned into the three subtrees M1 , M2 , M3 , where only M2 can be empty. Pruning different leaves labeled 1 from M results in two distinct trees in UD0,k if M1 is non-empty, and an identical tree otherwise. Since there are 2k − 4 edges that are not incident to the leaf labeled 1 in every unrooted binary tree in UD0,k , we have: 2|UD1,k | = (2k − 4) · |UD0,k |. Therefore, by Lemma 1, we obtain: |UD1,k | = (k − 2) · (2k − 5)!!. (5) In the rest of this section, we will focus on the case where i > 1. The analysis for this case is more subtle than what we have done so far. Let M ∈ UDi,k , where i > 1. First, we have to graft a leaf labeled with i into a twin-cherry in a dup-tree T over [k] such that L2 (T ) = [i − 1] to get some dup-tree in UDi,k as illustrated Figure 3. In the dup-tree on the top in Figure 3, a leaf labeled with 3 is in the twin-cherry consisting of leaves labeled 1, whereas another is in the twin-cherry consisting of leaves labeled with 2. In this case, we have the following fact. Lemma 2 Let T be a dup-tree over [k] such that L2 (T ) = [i − 1], i ≤ k. If T contains a unique twin-cherry, then grafting a leaf labeled with i into either edge in the twin-cherry will produce the same tree in UDi,k . Conversely, consider a unrooted dup-tree M ∈ UDi,k . For a non-leaf node u and a node v that is adjacent to u, we use Mu (v) to denote the connected component containing v in M − u and call it a subtree adjacent to u. The node u ∈ V(M ) is said to be a duplication node if it is adjacent to two nodes v ′ and v ′′ such that Mu (v ′ ) and Mu (v ′′ ) are identical as rooted trees; in other words, there is a mapping f : V(Mu (v ′ )) → V(Mu (v ′′ )) such that (i) f (v ′ ) = v ′′ , (ii) (x, y) ∈ A(Mu (v ′ )) if and only if (f (x), f (y)) ∈ A(Mu (v ′′ )) and (iii) x is a leaf if and only if f (x) is a leaf labeled with the same taxa, where x, y ∈ V(Mu (v ′ )). Mu (v ′ ) and Mu (v ′′ ) are called the conjugate subtrees of u if u is a duplication node. The two edges that are correspondent with each other under f are also said to be conjugate. 7 4 COUNTING 1-GALLED NETWORKS 4 2 4 2 1 2 1 1 3 3 3 2 Prune Leaf 3 Graft Leaf 3 1 3 Graft Leaf 3 4 2 1 2 1 3 Figure 3: Grafting the second copy of Leaf 3 into either edge in the unique twin-cherry (circled) in a dup-tree T (bottom) produces the same dup-tree (top), where L2 (T ) = {1, 2}. Lemma 3 Let M be a dup-tree that may contain twin-cherries and L2 (M ) = [i]. (1) The non-leaf node in a twin-cherry is a duplication node. (2) For different duplication nodes u and v in M , their conjugate subtrees are disjoint. Proof. (1) This derives from the definition of a duplication node. (2) For different duplication nodes u and v, the conjugate subtrees associated with u contain leaves with labels that are different from the labels appearing in the conjugate subtrees associated with v, as each duplicated element labels exactly two leaves. Lemma 2 can now be generalized as follows. Lemma 4 Let M ∈ UDi,k , where i ≤ k, and let u be a duplication node of M with the conjugate subtrees M ′ and M ′′ . Grafting the second leaf labeled with i + 1 into an edge e in M ′ will produce the same tree as grafting the leaf in the edge conjugate to e in M ′′ . Second, some unrooted dup-trees in UDi+1,k are generated by grafting a new leaf labeled with i + 1 in a dup-tree in three or four times. Specifically, we have the following fact: Lemma 5 Let M ∈ UDi+1,k , i < k. (1) If ℓ′ (i + 1) and ℓ′′ (i + 1) are in the conjugate subtrees of a duplicate node in M , then, M ⊖ ℓ′ (i + 1) = M ⊖ ℓ′′ (i + 1), from which M can only be generated by grafting a leaf labeled with i + 1 in a unique edge. (2) If neither ℓ′ (i + 1) nor ℓ′′ (i + 1) is in the conjugate subtrees of any duplicate node, M ⊖ ℓ′ (i + 1) and M ⊖ ℓ′′ (i + 1) are different dup-trees in UDi,k if and only if p(ℓ′ (x)) and p(ℓ′′ (x)) are not adjacent. (3) If ℓ′ (i + 1) is not in a conjugate subtree of any duplication node and if pruning ℓ′ (i + 1) from M does not produce a new duplication node, then M can be obtained from M ⊖ ℓ′ (i + 1) by grafting a leaf labeled with i + 1 in a unique edge. 8 4 COUNTING 1-GALLED NETWORKS a 1 5 5 1 2 2 Graft 3 3 2 2 4 3 1 b 4 3 1 1 5 2 2 1 1 2 2 5 3 2 2 4 1 1 5 2 2 1 3 1 3 4 1 1 3 5 2 2 5 2 2 1 4 3 1 3 c 5 1 4 3 3 4 3 4 3 4 Figure 4: A unrooted dup-tree in UDi+1,k can be generated by grafting a leaf labeled with i + 1 once (a), three times (b) and four times (c) from the dup-tree M such that L2 (M ) = [i] and M contains at most one twin-cherry. Here, the circled subtrees consist of a duplication node and the associated conjugate subtrees. (a) The right-handed unrooted dup-tree is in UD3,5 that has two Leaves 3 in the conjugate subtrees of a duplication node. It can only be generated by grafting leaves with 3 in a unique edge in a unique dup-tree on the left. (b) None of the leaves labeled with 3 is in a conjugate subtree in the right-handed duptree. Pruning the left-hand leaf labeled with 3 gives a dup-tree (left) that contains a new duplication node, but not for the right-handed leaf labeled with 3. Conversely, the right dup-tree can be generated from the left-handed dup-trees (in UD2,5 ) by grafting a Leaf 3 in three ways. (c) Neither of the leaves labeled with 4 are in the conjugate subtrees of a duplication node in the right-handed dup-tree. But pruning each of the leaves gives a duptree (left) that contains a new duplication node. Conversely, the right-handed dup-tree can be generated from two left-handed dup-trees (in UD3,5 ) by grafting a Leaf 4 four times. (4) If ℓ′ (i + 1) is not in a duplication subtree but M ⊖ ℓ′ (i + 1) contains one duplication node that is not a duplication node in M , then, M can be obtained by grafting a leaf labeled i + 1 in two different edges in M ⊖ ℓ′ (i + 1). By Lemma 5, a unrooted dup-tree in UDi+1,k can be generated by grafting a leaf labeled with i + 1 in a unique dup-tree twice, in two different dup-trees twice, in two dup-trees three or four times, as illustrated in Figure 4. 4 COUNTING 1-GALLED NETWORKS 9 We are now ready to establish a formula for |UDi+1,k |. We will use the following parameters: • Ci,k : the set of unrooted dup-trees T over [k] such that L2 (T ) = [i] and that contains only one twin-cherry. • O1 : the number of unrooted dup-trees T in UDi+1,k such that two leaves with i + 1 are in the conjugate subtrees of a duplication node. • O3 : the number of unrooted dup-trees T in UDi+1,k such that the removal of one labeled with i + 1 gives a dup-tree with one more duplication node than T , but the removal of the other does not change the duplication nodes. • O4 : the number of unrooted dup-trees T in UDi+1,k such that the removal of either of the leaves labeled with i + 1 gives a dup-tree with one more duplication node than T . To generate all the unrooted dup-trees in UDi+1,k , we graft another leaf labeled with i + 1 in all but the edge incident to Leaf i + 1 in each dup-tree in UDi,k and in each edge of the unique twin-cherry in each dup-tree in Ci,k . Since each dup-tree in UDi,k has k + i leaves and 2(k + i) − 3 edges, we have the following identity: 2(k + i − 2)|UDi,k | + 2|Ci,k | = 2|UDi+1,k | − O1 + O3 + 2O4 . (6) Lemma 6 Let i < k. We then have: |Ci,k | = i · |U Di−1,k |, X i O1 = · (2d − 1)!! · |UDi−d,k−d |, d 1≤d≤i X i O3 + 2O4 = · (2d − 1)!! · |U Di−d,k−d+1 |. d 1≤d≤i (7) (8) (9) Proof. For any S ⊆ [k] of j elements, the dup-trees T over [k] such that L2 (T ) = S have oneto-one correspondence with the dup-trees T ′ over [k] such that L2 (T ′ ) = [j]. The dup-trees in Ci,k that contains only one twin-cherry consisting of leaves labeled with i′ (i′ ≤ i) can be generated by grafting Leaf i′ in the unique edge incident to Leaf i′ in every twin-cherry-free dup-tree over [k] such that L2 = [i] − {i′ }. Taken together, these two facts imply Eqn. (7). Note that O1 is equal to the number of dup-trees over [k] in which two Leaves i + 1 appear in the conjugate subtrees of a duplication node. We assume that T is such a dup-tree over [k] and u is the duplication node whose conjugate subtrees contain leaves labeled with j + 1. Let T ′ = T ⊖ {ℓ′ (i + 1), ℓ′′ (i + 1)}. T ′ is then a dup-tree over [k] − {i + 1} such that L2 (T ′ ) = [i] and u remains as a duplication node in T ′ . Conversely, let T ′′ be a dup-tree over [k] − {i + 1} such that L2 (T ′′ ) = [i]. If T ′′ contains a duplication node u whose conjugate subtrees are of d leaves, then simultaneously grafting two leaves labeled with i + 1 in each of the (2d − 2) pairs of conjugate edges in the conjugate subtrees of u as well as in the two edges incident to u, we obtain (2d − 1) dup-trees over [k] such that u is still a duplication node. Note that there are (2d − 3)!! rooted binary trees 4 COUNTING 1-GALLED NETWORKS 10 with d leaves, and removing the conjugate subtrees from T ′′ and treating u as a leaf with a new label generates a dup-tree D with (k − 1) − d + 1 labels, i − d of which are duplicated labels. Summing over all possible d values from 1 to i, we obtain: X O1 = #(d-element subsets of [i]) · (1 + #(edges in a tree in Td )) · |Td | · |U Di−d,k−d | 1≤d≤i X i = (2d − 1) · (2d − 3)!! · |UDi−d,k−d |. d 1≤d≤i Therefore, Eqn. (8) holds. To prove Eqn. (9), we let Pi,k (S) be the set of the unrooted dup-trees T over [k] in which L2 (T ) = [i] and where there is a duplication node u whose conjugated subtrees have leaves with labels in S for any S ⊆ [i] of d labels. Grafting a new leaf labeled with i + 1, ℓ′ (i + 1), in each edge in a fixed conjugate subtree of u as well as an edge incident to u gives (2d − 1) dup-trees T ′ such that T ′ ⊖ ℓ′ (i + 1) = T . Clearly, T contains one more duplication node than such a T ′ . Note that the removal of the conjugate subtrees of u transforms each tree in Pi,k (S) into a tree T ′′ over {u} ∪ [k] − S such that L2 (T ′′ ) = [i] − S if u is considered to be a labeled leaf. Thus, |Pi,k (S)| = (2d − 1) · (2d − 3)!! · |UDi−d,k−d+1 |. Conversely, let T ∈ U Di+1,k . If T ⊖ ℓ′ (i + 1) and T ⊖ ℓ′′ (i + 1) both contain a new duplication node compared with T , T can be generated by grafting from two different duptrees in ∪S⊆[i] Pi,k (S). Therefore, i X X i |Pi,k (S)| = · (2d − 1)!! · |U Di−d,k−d+1 | = O3 + 2O4 . d d=1 S⊆[i] This proves Eqn. (9). By plugging Eqn. (7)–(9) into Eqn. (6), we obtain the following recursive formula for |UDi+1,k |. (i) Theorem 3 Let i < k and Nk = |UDi,k | = |Gi,k−1 |. We then have: 1 X i (i−1) (i) (i+1) (i−d) (i−d) (2d − 1)!! Nk−d − Nk−d+1 . + = (k + i − 2)Nk + iNk Nk 2 1≤d≤i d Example 4.1 For k = 4, we have: (0) N4 = (2 × 4 − 5)!! = 3, (1) (0) N4 = (4 − 2)N4 = 6, 1 (0) (0) (2) (1) (0) = 20, N −N N4 = 3N4 + N4 + 2 3 4 2 1 2 (1) (3) (2) (1) (1) (0) (0) N4 = 4N4 + 2N4 + = 87. + N 3 − N4 3!! N2 − N3 2 1 2 (i) Table 1 lists the values of Nk for i and k such that 0 ≤ i < k and 2 ≤ k ≤ 10. (10) 11 5 COUNTING GENERAL GALLED NETWORKS (i) Table 1: The values of Nk for 0 ≤ i < k and 2 ≤ k ≤ 10. k 2 i 4.2 3 4 5 6 7 8 9 10 11 0 1 1 3 15 105 945 10,395 135,135 2,027,025 34,459,425 1 0 1 6 45 420 4,725 62,370 945,945 16,216,200 310,134,825 2 - 3 20 189 2,160 28,875 442,260 7,640,325 147,026,880 3,119,591,475 3 - - 87 993 13,407 207,135 3,603,915 69,757,065 1,487,243,835 34,639,019,415 4 - - - 6,249 97,182 1,701,855 33,121,890 709,428,825 16,587,636,030 420,498,508,815 5 - - - - 804,585 15,738,765 338,588,685 7,946,584,695 202,099,078,125 5,537,451,658,725 6 - - - - - 161,685,045 3,808,469,970 97,162,333,695 2,669,506,204,050 78,595,220,899,125 7 - - - - - - 46,726,507,485 1,287,228,175,065 37,987,475,258,565 1,195,779,444,849,670 8 - - - - - - - 18,363,976,595,055 579,247,192,040,580 19,410,597,807,225,300 9 - - - - - - - - 9,420,991,174,195,960 334,803,875,697,765,000 10 - - - - - - - - - 6,114,381,201,716,870,000 A formula for counting 1-galled networks A 1-galled network over [k] may contain 0 to k reticulate nodes. Since the 1-galled networks with i reticulate nodes have one-to-one correspondence with the rooted dup-trees over [k] that have i duplicated labels, they have one-to-one correspondence with the unrooted duptrees over [k + 1] that have i duplicated labels. Therefore, we have the following theorem. Theorem 4 Let G1 (k) denote the number of 1-galled networks over k taxa. We then have, k X k (i) G1 (k) = (11) Nk+1 , i i=0 (i) where Nk+1 is defined in Eqn. (10). Example 4.1 (con’t) By Theorem 4, the number of 1-galled network on three taxa is: 3 3 3 (0) (1) (2) (3) N4 + N4 + N4 + N4 = 168. 1 2 3 All 34 topological structures of these 168 1-galled networks are drawn in Figure S1. 5 5.1 Counting general galled networks Compression of galled networks The technique of network decomposition was first introduced to study two algorithmic problems for RPNs in [10]. Recently, component-wise compression was formally investigated to reveal the connection between several classes of RPNs [11]. Intuitively, compressing a RPN N involves replacing every component in N with a node of degree 2 or more, thereby creating a smaller network Ñ that summarizes the relationships among tree-components in N . The node and edge sets of the compression network Ñ of N is rigorously defined as follows: V(Ñ ) =L(N ) ∪ {vτ : τ is a tree- or reticulation component in N }, and E(Ñ ) ={(vτ , ℓ) : ℓ ∈ L(N ) and p(ℓ) ∈ τ } ∪ {(vτ , vτ′ ) : there is (x, y) ∈ E(N ) such that x ∈ τ, y ∈ τ ′ }, 12 5 COUNTING GENERAL GALLED NETWORKS where p(ℓ) denotes the parent of the leaf ℓ. The operation of network compression is illustrated in Figure 5. In a galled network, each reticulate node is inner and thus both its parents are in a common tree-component. Therefore, a tree-component becomes a node with at least two children and each reticulate node becomes a node of indegree 1 and outdegree 1 after the treecomponents are compressed. Thus, the compression of a galled network is a tree (Theorem 3.1, [11]) (see Figure 5), implying that a galled network consists of a set of 1-galled networks stacked one on the top of the other in a tree shape. 5.2 A counting method We are now ready to count general galled networks over [k]. Let Ak be the set of non-binary phylogenetic trees over [k] in which every non-leaf node has two or more children. Assume that T ∈ Ak . For a non-leaf node v ∈ V(T ), we use clf (v) and cnlf (v) to denote the numbers of leaf and non-leaf children of v in T , respectively, and define c(v) = clf (v) + cnlf (v). Clearly c(v) is the number of the children of v in T . Consider a binary galled network N over [k]. By Theorem 3.1 in [11], the compression C(N ) of N is a tree over [k]. A node of indegree and outdegree 1 in C(N ) corresponds one-to-one to a reticulate node in N , whereas a tree node with two or more children in C(N ) corresponds one-to-one to a tree-component in N . For convenience, we suppress all the nodes of indegree and outdegree 1 in C(N ) to get rooted tree C ′ (N ) ∈ Ak . Clearly, the tree-components of N are still in one-to-one correspondence with the tree nodes in C ′ (N ). By reverse-engineering this process, we can enumerate and count general galled networks over [k], as all possible general rooted trees over [k] can be enumerated and counted recursively [6]. Theorem 5 Let G(n) be the number of galled networks over n taxa. We then have:    c(v) X Y X clf (v) (j)   Nc(v)+1  , G(n) = c(v) − j T ∈A j=cnlf (v) v∈I(T ) k (12) (j) where I(T ) denotes the set of non-leaf nodes in T and Nc(v)+1 is defined in Eqn. (10). b a 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Figure 5: Illustration of the network compression operation. (a) A galled network over [8]. It has four tree-components. (b) The compression of the network in (a). It is a rooted tree in which each reticulate node becomes a node of indegree 1 and outdegree 1. 13 6 CONCLUSION Proof. Let T ∈ Ak such that C ′ (N ) = T for some galled network N . Consider a nonleaf node v in T . We first consider how to reconstruct the tree-component σ of N that corresponds to v. Let R denote the set of reticulate nodes in N whose parents are both in σ. For each child u of v that is a non-leaf, the root of the tree-component corresponding to u must be a child of a reticulate node in R. However, for each child ℓ of v that is a leaf, the parent of ℓ in N may be a tree node in σ or a reticulate node in R. Therefore, clf cnlf (v) ≤ |R| ≤ c(v) and R has |R|−cnlf (v) possibilities. For each possible selection of R, σ corresponds to a 1-galled network with c(v) leaves and |R| reticulate nodes. Thus, the (j) Pc(v) clf (v) component σ corresponding to v has j=cnlf (v) c(v)−j Nc(v)+1 choices. Since the reconstructions of two distinct tree-components in N are independent from each other, networks whose compression correspond to T is Pthe number of galled Q (j) c(v) clf (v) v∈I(T ) j=cnlf (v) c(v)−j Nc(v)+1 . Hence, the theorem follows. The numbers G(n) of galled networks over n taxa was calculated according to Therem 5 and are listed in Table 2. For example, G(3) = 240. This implies that there are 240−168 = 72 galled networks with two tree-components over three taxa, the topological structures of which are listed in Figure S2. Table 2: The values of G(n) for 1 ≤ n ≤ 10. 6 n G(n) 1 1 2 6 3 240 4 20,502 5 2,868,990 6 589,130,280 7 167,357,180,970 8 63,356,654,623,500 9 31,092,212,800,634,500 10 19,327,089,427,089,400,000 Conclusion We have presented a linear recurrence formula for counting all possible 1-galled networks and a method for counting and enumerating general galled networks. We conclude the study with a couple of remarks. First, using the same counting technique as in Section 4.1, we can derive the following recurrence formula for the number of all unrooted dup-trees T such that L2 [T ] = [i], denoted 14 REFERENCES (i) by Bk : (0) Bk = (2k − 5)!!, B (1) = (k − 1) · (2k − 5)!! (i+1) Bk = (n + k − (i) 1)Bk 1 X i (i−d) (i−d) + · (2d − 1)!! · Bk−d − Bk−d+1 . 2 1≤d≤i d (13) Second, galled networks form a subclass of reticulation-visible networks. We therefore pose counting reticulaiton-visible networks as an open question. Acknowledgements The authors thank HW Yan and Jonathan M. Woenardi for participanting in discussion on this work. This work was supported by Singapore Ministry of Education Academic Research Fund Tier-1 [grant R-146-000-238-114] and National Research Fund [grant NRF2016NRFNSFC001-026]. References [1] Bordewich, M., Semple, C.: Reticulation-visible networks. Adv. Applied Math. 78, 114–141 (2016) [2] Bouvel. M., Gambette, P., Mansouri, M.: Counting level-k phylogenetic networks. In preparation (2018) [3] Cardona, G., Llabrés, M., Rosselló, F., Valiente, G.: Metrics for phylogenetic networks i: Generalizations of the Robinson-Foulds metric. IEEE/ACM Trans. Comput. Biol. Bioinform. 6(1), 46–61 (2009) [4] Cardona, G., Rossello, F., Valiente, G.: Comparison of tree-child phylogenetic networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 6(4), 552–569 (2009) [5] Czabarka É, Erdős PL, Johnson V., Moulton V.: Generating functions for multi-labeled trees. Discrete Applied Math. 161, 107-117 (2013) [6] Felenstein, J.: Inferring Phylogenies. Sinauer Associates, Sunderland, MA, USA (2004) [7] Francis, A.R., Steel, M.: Which phylogenetic networks are merely trees with additional arcs? Syst. Biol. 64(5), 768–777 (2015) [8] Fuchs, M., Gittenberger, B. and Mansouri, M.: Counting phylogenetic networks with few reticulation vertices: Tree-child and normal networks. arXiv preprint arXiv:1803.11325 (2018) [9] Gambette, P., Gunawan, A.D., Labarre, A., Vialette, S., Zhang, L.: Locating a tree in a phylogenetic network in quadratic time. In: Proc. Int’l Confer. on Res. in Comput. Mol. Biol. (RECOMB), pp. 96–107. Springer, New York (2015) REFERENCES 15 [10] Gunawan, A.D., DasGupta, B., Zhang, L.: A decomposition theorem and two algorithms for reticulation-visible networks. Inform. Comput. 252, 161–175 (2017) [11] Gunawan, A.D., Yan, H., Zhang, L.: Compression of phylogenetic networks and algorithm for the tree containment problem. J. Comput. Biol. (in press). ArXiv preprint arXiv:1806.07625 (2018) [12] Gusfield, D.: ReCombinatorics: the Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks. MIT Press, Boston, USA (2014) [13] Gusfield, D., Eddhu, S., Langley, C.: The fine structure of galls in phylogenetic networks. INFORMS J. Comput. 16(4), 459–469 (2004) [14] Huber KT, Moulton V.: Phylogenetic networks from multi-labelled trees. J. Math. Biol. 52, 613–632 (2006) [15] Huson, D.H., Klöpper, T.H.: Beyond galled trees–decomposition and computation of galled networks. In: Proc. Int’l Confer. on Res. in Comput. Mol. Biol. (RECOMB), pp. 211–225. Springer, New York, USA (2007) [16] Huson, D.H., Rupp, R., Berry, V., Gambette, P., Paul, C.: Computing galled networks from real data. Bioinformatics 25(12), i85–i93 (2009) [17] Huson, D.H., Rupp, R., Scornavacca, C.: Phylogenetic networks: Concepts, Algorithms and Applications. Cambridge University Press, Cambridge, UK (2010) [18] Jain R, Rivera MC, Lake JA.: Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Nat’l Acad. Sc. U.S.A. 96, 3801–3806 (1999) [19] McDiarmid, C., Semple, C. and Welsh, D.: Counting phylogenetic networks. Annals Combin. 19, 205–224 (2015) [20] Semple, C. and Steel, M.: Unicyclic networks: compatibility and enumeration. IEEE/ACM Trans. Comput. Biol. Bioinform. 3, 84–91 (2006) [21] Steel, M.: Phylogeny: Discrete and Random Processes in Evolution. SIAM, Philadelphia, USA (2016) [22] Wang, L., Zhang, K., Zhang, L.: Perfect phylogenetic networks with recombination. J. Comput. Biol. 8(1), 69–78 (2001) [23] Yan, H., Gunawan, A.D., Zhang, L.: S-cluster++: a fast program for solving the cluster containment problem for phylogenetic networks. Bioinformatics 34(17), i680–i686 [24] Zhang, L.: On tree-based phylogenetic networks. J. Comput. Biol. 23(7), 553–565 (2016) [25] Zhang, L.: Clusters, trees and phylogenetic network classes. In T. Warnow (ed.): Bioinformatics and Phylogenetics: Seminal Contributions of Bernard Moret, Springer, New York (2019) REFERENCES 16 Figure S1: The 34 toplogical structures of the 168 1-galled networks over three taxa. Figure S2: The 16 toplogical structures of the 72 galled networks over three taxa that have two tree-components.

Log In

Counting and enumerating galled networks

Related papers

Related papers

Related topics