Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

On the Approximation of Computing Evolutionary Trees

2005

Given a set of leaf-labelled trees with identical leaf sets, the well-known MAST problem consists of finding a subtree homeomorphically included in all input trees and with the largest number of leaves. MAST and its variant called MCT are of particular interest in computational biology. This paper presents positive and negative results on the approximation of MAST, MCT and their complement versions, denoted CMAST and CMCT. For CMAST and CMCT on rooted trees we give 3-approximation algorithms achieving significantly lower running times than those previously known. In particular, the algorithm for CMAST runs in linear time. The approximation threshold for CMAST, resp. CMCT, is shown to be the same whenever collections of rooted trees or of unrooted trees are considered. Moreover, hardness of approximation results are stated for CMAST, CMCT and MCT on small number of trees, and for MCT on unbounded number of trees.

On the Approximation of Computing Evolutionary Trees Vincent Berry, N. Francois, Sylvain Guillemot, Christophe Paul To cite this version: Vincent Berry, N. Francois, Sylvain Guillemot, Christophe Paul. On the Approximation of Computing Evolutionary Trees. Lusheng Wang. COCOON’05: 11th Annual International Conference on Computing and Combinatorics, 2005, pp.115-125, 2005, Lecture Notes in Computer Science. <lirmm-00106451> HAL Id: lirmm-00106451 http://hal-lirmm.ccsd.cnrs.fr/lirmm-00106451 Submitted on 16 Oct 2006 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. 115 On the Approximation of Computing Evolutionary Trees Vincent Berry⋆ , Sylvain Guillemot, François Nicolas, and Christophe Paul Département Informatique, L.I.R.M.M. - C.N.R.S. 161 rue Ada, 34392 Montpellier Cedex 5 {vberry,sguillem,nicolas,paul}@lirmm.fr Abstract. Given a set of leaf-labelled trees with identical leaf sets, the well-known MAST problem consists of finding a subtree homeomorphically included in all input trees and with the largest number of leaves. MAST and its variant called MCT are of particular interest in computational biology. This paper presents positive and negative results on the approximation of MAST, MCT and their complement versions, denoted CMAST and CMCT. For CMAST and CMCT on rooted trees we give 3-approximation algorithms achieving significantly lower running times than those previously known. In particular, the algorithm for CMAST runs in linear time. The approximation threshold for CMAST, resp. CMCT, is shown to be the same whenever collections of rooted trees or of unrooted trees are considered. Moreover, hardness of approximation results are stated for CMAST, CMCT and MCT on small number of trees, and for MCT on unbounded number of trees. 1 Introduction Given a set of leaf-labelled trees with identical leaf sets, the well-known Maximum Agreement SubTree problem (MAST) consists of finding a subtree homeomorphically included in all input trees and with the largest number of leaves [2, 7, 10, 13, 21, 22]. In other words, this involves selecting a largest set of input leaves such that the input trees are isomorphic, i.e. agree with each other, when restricted to these leaves. This problem arises in various areas including phylogenetics which is concerned with evolutionary trees, i.e. trees representing the evolutionary history of a set of species: the leaves of the tree are in one-to-one correspondence with species under study and the branching pattern of the tree describes the way in which speciation events lead from ancestral species to more recent ones. In phylogenetics, the MAST problem is used to reach different practical goals: to obtain a consensus of several trees inferred by different methods, or that are optimal for a given criteria; to measure the similarity between different evolutionary scenarii; to identify horizontal transfers of genes. Recently, MAST has ⋆ Supported by the Act. Incit. Inf.-Math.-Phys. en Biol. Mol. [ACI IMP-Bio] and the Act. Inter. Incit. Région. [BIOSTIC-LR]. L. Wang (Ed.): COCOON 2005, LNCS 3595, pp. 115–125, 2005. c Springer-Verlag Berlin Heidelberg 2005  116 Vincent Berry et al. been extended to the context of supertrees where input trees can have different sets of leaves [3]. The Maximum Compatible Tree problem (MCT) is a variant of MAST that is of particular interest in phylogenetics when the input trees are not binary [11, 12, 15, 17]. MCT requires that selected subtrees of the input trees are compatible, i.e. that groups of leaves they define can all be combined in a same tree. This is less strict than requiring the isomorphism of the subtrees, hence usually leads to selecting a larger set of leaves than allowed by MAST. We give below a brief overview of the litterature, precising how the results presented in this paper relate to previously known results. The MAST problem is NP-hard on three rooted trees of unbounded degree [2], and MCT on two rooted trees if one of them is of unbounded degree [17]. Subquadratic algorithms have been proposed for MAST on two rooted n-leaf trees [7, 20, 21]. When dealing with k rooted trees, MAST can be solved in O(nd + kn3 ) time provided that the degree of one of the input trees is bounded by d [2, 6, 10], and MCT can be solved in O(22kd nk ) time provided that all input trees have degree bounded by d [12]. Both problems can be solved in O(min{3p kn, 2.27p + kn3 }) time, i.e. are FPT in p, where p is the smallest number of leaves to be removed from the input set of leaves so that the input trees agree [3]. More generally, when the previously mentioned parameters are unbounded, several works (starting from [2]) propose 3-approximation algorithms for CMAST and CMCT, where CMAST, resp. CMCT, is the complement version of MAST, resp. MCT, i.e. aims at selecting the smallest number of leaves to be removed from the input trees in order to obtain their agreement. In practice, input trees usually agree on the position of most leaves, thus approximating CMAST and CMCT is more relevant than approximating MAST and MCT. For CMCT, [11] propose an O(k 2 n2 ) time 3-approximation algorithm. We propose here an O(n2 + kn) time algorithm. For MAST, [3] propose an O(kn3 ) time algorithm. Here we improve on this result by providing a linear time, i.e. O(kn), algorithm. We also state that rooted and unrooted versions of CMAST (and CMCT) have the same approximation threshold. Let k-MAST, resp. k-MCT, resp. k-CMAST, resp. k-CMCT, denote the particular case of MAST, resp. MCT, resp. CMAST, resp. CMCT, dealing with k rooted trees. Negative results for these problems are as follows: • For all ǫ > 0, the general MAST problem is not approximable within n1−ǫ unless NP = ZPP [5]. A similar result is obtained here for MCT. • It also stated here that 3-CMAST and 2-CMCT are APX-hard, i.e. that they do not admit a PTAS unless P = NP. δ • For all δ < 1, 3-MAST is not approximable within 2log n unless NP ⊆  polylog n [17]. The same result is obtained here for 2-MCT. DTIME 2 2 Definitions and Preliminaries A rooted evolutionary tree is a tree whose leaf set L(T ) is in bijection with a label set, and whose internal nodes have at least two children. Hereafter, we only On the Approximation of Computing Evolutionary Trees 117 consider such trees and identify leaves with their respective labels. The size of a tree T (denoted #T ) is the number of its leaves: #T = #L(T ). Let u be a node of a tree T , S(u) stands for the subtree rooted at u, L(u) for the leaves of this subtree, and d+ (u) for the number of children of u. For a set of leaves L ⊆ L(T ), lcaT (L) denotes the lowest common ancestor of leaves L in T . Given a set L of labels and a tree T , the restriction of T to L, denoted T |L, is the tree homeomorphic to the smallest subtree of T connecting leaves of L. Lemma 1. Let T1 and T2 be two isomorphic trees with leaf set L, and let L′ ⊆ L, then T1 |L′ is isomorphic to T2 |L′ . Given a collection T = {T1 , T2 , . . . , Tk } of trees on a same leaf set L of cardinality n, an agreement subtree of T is any tree T with leaves in L s.t. ∀Ti ∈ T , T = Ti |L(T ). The MAST problem consists in finding an agreement subtree of T with the largest number of leaves. We denote M AST (T ) such a tree. A tree T refines a tree T ′ , if T ′ can be obtained by collapsing certain edges of T , (i.e. merging their extremities). More generally, a tree T refines a collection T , whenever T refines all Ti ’s in T . Given a collection T of k trees with identical leaf set L of cardinality n, a tree T with leaves in L is compatible with T iff ∀Ti ∈ T , T refines Ti |L(T ). If there is a tree T compatible with T s.t. L(T ) = L, i.e. that is a common refinement of all trees in T , then the collection T is compatible. In this case, a minimum refinement T of T (i.e. collapsing a minimum number of edges) is a tree s.t. any tree T ′ refining T also refines T . Collections of trees considered in practice are usually not compatible, motivating the MCT problem which aims at finding a tree, denoted M CT (T ), compatible with T and having the largest number of leaves. Remark that MCT is equivalent to MAST when input trees are binary. For any three leaves a, b, c in a tree T , there are only three possible binary shapes for T |{a, b, c}, denoted a|bc, resp. b|ac, resp. c|ab, depending on their innermost grouping of leaves (bc, resp. ac, resp. ab). These binary trees on 3 leaves are called rooted triples. Alternatively T |{a, b, c} can be a fan, i.e. a unique internal node connected to the three leaves. A fan is denoted {a, b, c}. We define rt(T ), resp. f (T ), as the set of rooted triples, resp. fans, induced by the leaves of a tree T . Given a collection T = {T1 , T2 , . . . , Tk } of trees with leaf set L, a set {a, b, c} ⊆ L is a hard conflict between (trees of) T whenever ∃Ti , Tj ∈ T s.t. a|bc ∈ rt(Ti ) and b|ac ∈ rt(Tj ). The set {a, b, c} is a soft conflict between (trees of) T whenever a|bc ∈ rt(Ti ) and {a, b, c} ∈ f (Tj ). Lemma 2 ([2, 3, 12]). Two trees with the same leaf set are isomorphic iff there is no hard nor any soft conflict between them. A collection T of trees with the same leaf set is compatible iff there is no hard conflict between T . Definition 1. Given a set of conflicts C, let L(C) denote the leaves appearing in C. Given a collection T with conflicts, an hs-peacemaker, resp. h-peacemaker, of T is any set C of disjoint hard and soft, resp. only hard, conflicts between T s.t. {Ti |(L − L(C)) : Ti ∈ T } is a collection of isomorphic trees, resp. compatible 118 Vincent Berry et al. trees. In other words, removing L(C) from the input trees removes all conflicts, resp. all hard conflicts, between them. 3 3.1 Approximation Algorithms An O(n2 + kn) Time 3-Approximation Algorithm for CMCT Let T be a collection of trees on an n-leaf set L. It is well-known that T is compatible iff every pair of trees in T is compatible [8]. Moreover, Lemma 3 ([4]). T is a compatible  collection of trees iff there exists a minimum refinement T of T and rt(T ) = Ti ∈T rt(Ti ). If T is compatible, a minimum refinement T of T is a solution for MCT, as L(T ) = L. From Lemma 2, one can obtain T by first computing a minimum refinement T1,2 of two trees T1 , T2 ∈ T , and then iterating on T −{T1 , T2 }∪{T1,2 } until only one tree remains that is the sought tree T . If T is not compatible, then we apply the following: Lemma 4 ([2, 3, 11]). Let T = {T1 , T2 , . . . , Tk } be a collection of trees on a leaf set L and let C be an hs-peacemaker, resp. an h-peacemaker, of T . Then any tree in T |(L − L(C)) is a 3-approximation for CMAST, resp. any refinement of T |(L − L(C)) is a 3-approximation for CMCT, on T . Given a pair of trees, [4] give an O(n) time algorithm that either returns a minimum refinement when the trees are compatible, or otherwise identifies a hard conflict C between them. Thus, from Lemma 4, the procedure sketched above for a compatible collection, can be adapted to obtain a 3-approximation of CMCT for a non-compatible collection T . Apply the algorithm of [4] to a pair of trees {T1 , T2 } ⊆ T to obtain either their minimum refinement T1,2 or a hard conflict C. In the latter case, remove C from all input trees and iterate. In the former case, iterate on T − {T1, T2 } ∪ {T1,2}. When T is reduced to a single tree, O(k + n) calls to the algorithm of [4] have been issued and the resulting set C of removed conflicts is an h-peacemaker. Hence: Theorem 1. The CMCT problem on a collection of k rooted trees on a same n-leaf set can be 3-approximated in O(n2 + kn) time. 3.2 A Linear Time 3-Approximation Algorithm for CMAST W.l.o.g., this section considers input trees on a same n-leaf set labelled by positive integers 1, 2, . . . , n. First consider collections T of two trees. The following characterization of isomorphic trees is the basis of our algorithm. Lemma 5 ([4]). Two trees Ti and Tj are isomorphic iff rt(Ti ) = rt(Tj ) and f (Ti ) = f (Tj ). On the Approximation of Computing Evolutionary Trees 119 The definition of M AST (T ) is independent of the order of the children of nodes in trees. However, to efficiently compute an approximation of M AST (T ), we considered that T1 and T2 are ordered. Ordering a tree T consists in totally ordering the children of every node in T . Thereby, this uniquely defines a leftright order πT on the leaves L of T . Given an arbitrary ordering of T1 , the approximation algorithm first tries to order T2 accordingly. In the following, π1 , resp. π2 , stands for πT1 , resp. πT2 ; and π2 (i) stands for the i-th leaf in π2 . W.l.o.g., we also assume π1 = 1 . . . n. Definition 2. Let π be an order on a set L. A subset S of L is an interval of π whenever the elements of S occur consecutively in π (but not necessarily in the same order). A tree T with leaf set L is embeddable in an order π on L whenever T can be ordered s.t. πT = π. Lemma 6. Let T be a tree with leaf set L and π be an arbitrary order of L. Then, T is embeddable in π iff for any node u of T , L(u) is an interval of π. Proposition 1. Let T be a tree and π be an order on its leaves. Testing whether T is embeddable in π costs O(n) time. In the positive, ordering T such that πT = π can be done in O(n) time. The running time stated in this proposition is achieved by performing bottomup walks on disjoint paths in T , as described by Algorithm 1. For a node u in a tree, let m(u) and M (u) resp. denote the smallest and largest leaf of L(u) in π. Assume the children of any non-leaf node v ∈ T are originally stored in a doublylinked list lc (v) which has to be ordered into a list lc′ (v) so that πT |L(v) = π|L(v). Algorithm 1: TreeOrder(T, π) for any node u in T do lc′ (u) ← ∅ ; for i = 1 to n do let u be the leaf s.t. u = π −1 (i); repeat Let v be the parent node of u in T ; Remove u from lc (v) and put it at the end of lc′ (v); u ← v; until i = m(u) or u is the root; Due to the existence of conflicting triples, two arbitrary trees T1 and T2 with same leaf set L may not be embeddable in a common order of L. If so, we can however show the following: Proposition 2. Let T1 , T2 be trees with leaf set L = {1, . . . , n}. In time O(n)  it is possible to identify a set C of disjoint conflicts between T and T s.t. T | L− 1 2 2    L(C) is embeddable in π1 | L − L(C) . 120 Vincent Berry et al. Below is given a sketch of the proof for this proposition. Let u be a node in a tree T with leaf set L and π be an arbitrary order on L. If an element x ∈ L − L(u) is s.t. m(u) <π x <π M (u), then prevπ (x, u), resp. nextπ (x, u), stands for the maximum, resp. minimum, element of L(u) w.r.t. π that is smaller, resp. larger, than x. Lemma 7. Let T1 , T2 be trees on a leaf set L ⊆ {1, . . . , n} and let {a, b, c} ⊆ L. If both a <π1 b <π1 c and ac | b ∈ rt(T2 ), then {a, b, c} is a conflict between T1 and T2 . In particular, for a node u of T2 and a leaf x ∈ / L(u) s.t. m(u) <π1 x <π1 M (u) then {prevπ1 (x, u), x, nextπ1 (x, u)} is a conflict between T1 and T2 . This lemma guides the search of T2 to remove leaves (in T2 and T1 ) forming a set of disjoint conflicts C s.t. for any node u of T2 |(L − L(C)), L(u) is an interval of leaves in π1 |(L−L(C)). Such a node u is then said to be full. When all nodes of the resulting T2 are full, Lemma 6 ensures that T2 is embeddable in the left-right order of the tree T1 |(L − L(C)). Nodes of T2 are processed in post-order, such that the children of a node u are known to be full when u is processed. For efficiency reasons, a list LI of disjoint intervals of π1 is also maintained sorted w.r.t. to π1 . LI is initially composed of unit intervals ({1}, . . . , {n}) corresponding to leaves of T2 . Then intervals of LI are merged or removed while processing nodes of T2 so as to maintain the following invariant: Invariant 1. Any interval of the list LI contains the leaf set L(u) of some node u of T2 that is full w.r.t. π1 | L − L(C) . When a non-full node u is processed in the traversal of T2 , this invariant together with pointers from each children of u to the corresponding elements ordered in LI enables us (according to Lemma 7) to efficiently identify conflicts whose removal turns u into a full node. Note that Invariant 1 is robust under the removal of a leaf in L(v) for any processed node v. Lemma 8. Let T1 , T2 be two trees with leaf set in L and u be the current node of T2 to be processed by the bottom-up algorithm ( i.e. the children of u are full w.r.t. π1 ). Then a set hs(u) of disjoint conflicts between {T1 , T2 } s.t. u is full  w.r.t. π1 | L − L(hs(u)) is found in time O(d+ (u) + |hs(u)|). Proposition 2 follows from Lemma 7, Invariant 1 and Lemma 8. Given two arbitrary trees T1 , T2 , propositions 1 and 2 show that, in linear time, disjoint conflicts can be removed and children of nodes in T2 ordered s.t. the two resulting trees have the same left-right order on their leaves. Thus, from now on, assume that π1 = π2 . For convenience, even if some leaves have been removed, we note π1 = 1 . . . n. Even if T1 and T2 have the same left-right order on their leaves, they may still host conflicting triples. However, let us show that a post-order search of T1 (or equiv. T2 ) is sufficient to remove such conflicts. Definition 3. Let u be a node in a tree T , then rt(u) is the subset of triples x|yz ⊆ rt(T ) s.t. # {x, y, z} ∩ L(u) ≥ 2, and f (u) is the set of fans {x, y, z} ⊆ f (T ) s.t. {x, y, z} ⊆ L(u). Define a node u in tree T1 to be valid w.r.t. tree T2 if both rt(u) ⊆ rt(T2 ) and f (u) ⊆ f (T2 ) hold. On the Approximation of Computing Evolutionary Trees 121 Note that if r1 is the root node of tree T1 , then rt(r1 ) = rt(T1 ) and f (r1 ) = f (T1 ). Moreover, given a tree T2 s.t. L(T2 ) = L(T1 ), the validity of r1 w.r.t. T2 implies that T1 and T2 are isomorphic, as any 3-leaf set is either a rooted triple or a fan of both trees. Next lemma is the basis of a recursive process to obtain the validity of r1 w.r.t. T2 . Lemma 9. Let u be a node of T1 whose children, denoted c1 , . . . , cd+ (u) , are all valid. Let p(m(u)), resp. s(M (u)), be the leaf preceding m(u), resp. succeeding M (u), in π2 if it exists. 1. if {p(m(u))|m(u)M (u), s(M (u))|m(u)M (u)} ⊆ rt(T2 ) then rt(u) ⊆ rt(T2 ) 2. if u has only two children then f (u) ⊆ f (T2 ) 3. if u has at least three children and for any i ∈ {1, 2, . . . , d+ (u) − 2}, {m(ci ), m(ci+1 ), m(ci+2 )} ∈ f (T2 ), then f (u) ⊆ f (T2 ). Lemma 9 implies that if every node u ∈ T1 is processed after its children, examining only O(d+ (u)) 3-leaf sets is enough to know whether a node u ∈ T1 is already valid. When a conflict is encountered during this examination, its leaves are removed from the trees. Indeed, thanks to Lemma 1, removing a leaf in S(u) does not change the pre-established validity of inner nodes of S(u). Thus, if c(u) denotes the number of such encountered conflicts, ensuring the validity of u involves looking at O(d+ (u) + c(u)) 3-leaf sets. See Algorithm 2 for a complete description of the procedure. Note that persistent dummy leaves can be artificially added at the beginning and end of π1 and π2 s.t. p(m(u)) and s(M (u)) always exist for any processed node u. Processingthe whole tree T1 globally involves O(n) 3-leaf sets  + as u∈T1 c(u) = O(n) and u∈T1 d (u) = O(n) . Provided π1 is stored in a doubly-linked list; symmetric pointers are maintained between a node u ∈ T1 to be processed, and the two elements of π1 that are the leftmost and rightmost leaves of S(u); and T2 is preprocessed so as to identify in O(1) the least common ancestor of any two of its nodes; then Algorithm 2 runs in linear time. Hence, Theorem 2. The CMAST problem on a collection of k rooted trees with same n-leaf set can be 3-approximated in O(kn) time. The reader should notice that the above algorithms can be realized simultaneously by a single search of the tree. According to Proposition 1, Proposition 2 and Algorithm 2, the case k = 2 is solved in O(n) time. Handling a collection T = {T1 , T2 , . . . , Tk } of k > 2 trees is done as for the MCT problem (see Section 3.1), i.e. by successively considering pairs of trees in T . This procedure runs in O(nk) and, from Lemma 4, provides a 3-approximation of CMAST for T . 4 Inapproximability Results for MAST and MCT In this section, we first state that the rooted and unrooted versions of CMAST (equiv. CMCT) have the same approximation threshold. Then we detail new negative results concerning the approximation of MCT, CMAST and CMCT. 122 Vincent Berry et al. Algorithm 2: AgreementSubtree (T1 , T2 ) Input: Two rooted trees s.t. π1 = π2 for each node u in a post order traversal of T1 do /* Ensures that rt(u) ⊆ rt(T2 ) */ repeat m(u) ← leftmost leaf of S(u) ; M (u) ← rightmost leaf of S(u) p(m(u)) ← leaf preceding m(u) in π1 ; f (M (u)) ← leaf following M (u) in π1 if p(m(u))|m(u)M (u) ∈ / rt(T2 ) then remove p(m(u)), m(u), M (u) from T1 and T2 else if f (M (u))|m(u)M (u) ∈ / rt(T2 ) then remove f (M (u)), m(u), M (u) from T1 and T2 until {p(m(u))|m(u)M (u), f (M (u))|m(u)M (u)} ⊆ rt(T2 ) or d+ (u) < 2 /* Ensures that f (u) ⊆ f (T2 ) */ i←1 while d+ (u) > 2 and i ≤ d+ (u) − 2 do let c1 , c2 , . . . , cd+ (u) be the children of u if {m(ci ), m(ci+1 ), m(ci+2 )} ∈ f (T2 ) then i ← i + 1 else remove m(ci ), m(ci+1 ), m(ci+2 ) from T1 and T2 return T1 4.1 Rooted and Unrooted Versions of CMAST (equiv. CMCT) Share the Same Approximation Threshold Let ϕ(n, k) be a function in Ω(n × k). Proposition 3. Let ρ ≥ 1 be a real constant. Assume there exists a ρ-approximation algorithm for CMAST, resp. CMCT, on rooted trees with O(ϕ(n, k)) running time. Then, there exists a ρ-approximation algorithm for CMAST, resp. CMCT, on unrooted trees with O(n × ϕ(n − 1, k)) running time. Proposition 3 is implicitely used in [11] and is proved in the following way. Let U be a collection of unrooted trees. To ρ-approximate CMAST, resp. CMCT, on instance U, apply the hypothetical ρ-approximation algorithm to each collection obtained by rooting all trees in U at a same leaf. Then, return the best of the n computed solutions. Combining Theorem 2 and Proposition 3, resp. Theorem 1 and Proposition 3, we obtain that the unrooted version of CMAST, resp. CMCT, is 3-approximable in O(kn2 ), resp. O(n3 + kn2 ), time. Using a simple padding argument yields the converse of Proposition 3: Proposition 4. Let ρ ≥ 1 be a rational constant. Assume there exists a ρ-approximation algorithm for CMAST, resp. CMCT, on unrooted trees with O(ϕ(n, k)) running time. Then, there exists a ρ-approximation algorithm for CMAST, resp. CMCT, on rooted trees with O(ϕ(n + ⌈ρn⌉ , k)) running time. 4.2 Hardness of Approximating CMAST on Three Trees Theorem 3. The 3-CMAST problem is APX-hard. On the Approximation of Computing Evolutionary Trees 123 Since 2-MAST (and thus, 2-CMAST) can be exactly solved in polynomial time [21], Theorem 3 is somehow tight. Its proof relies on a careful reading of [17] which states that the general 3-MAST problem is APX-hard. In fact [17] proves that a restriction of 3-MAST to a certain set of instances is APX-hard. CMAST is not considered in [17], but it is easy to see that for this particular set of instances, 3-MAST L-reduces to 3-CMAST 4.3 Hardness of Approximating MCT and CMCT on Two Trees In order to prove Theorems 5 (APX-hardness of 2-CMCT) and 6 (inapproximability of 2-CMCT), we define an intermediate problem, called Maximum Star-Forest (MSF). Let G = (V, E) be a graph. A star-forest of G is a subset of E which does not contain any path of length 3. The MSF problem is: “given a graph G, find a star-forest of G that is of maximum cardinality” For each integer ∆ ≥ 1, we denote by ∆-MSFB the restriction of MSF to bipartite input graphs having maximum degree at most ∆. The restriction of the Maximum Independent Set (shortly MIS) to input graphs having maximum degree at most 3 is denoted 3-MIS. Note that 3-MIS is APX-complete [1]. Theorem 4. The 4-MSFB problem is APX-hard. Proof (sketch). We use an L-reduction from 3-MIS to 4-MSFB relying on the following transformation. Let G = (V, E) be an instance of 3-MIS (i.e. a graph with maximum degree at most 3), we construct an instance G′ = (V ′ , E ′ ) of 4-MSFB as follows. V ′ := V ∪ {γe : e ∈ E} ∪ {σv , τv : v ∈ V } ,     E ′ := {u, γe }, {γe , v} : e = {u, v} ∈ E ∪ {v, σv }, {σv , τv }, : v ∈ V . Clearly, G′ can be obtained from G in polynomial time, and #V ′ = m + 3n and #E ′ = 2m + 2n, where n and m denote the cardinality of V and E resp. ⊓ ⊔ Theorem 4 leads to the following result: Proposition 5. 2-MCT is APX-hard even if it is restricted to collections T of two rooted trees satisfying #M CT (T ) ≥ 14 × n, where n denotes the size of each tree in T . Proof (sketch). We use an L-reduction from 4-MSFB to 2-MCT relying on the following transformation. Let G = (V, E) be an instance of 4-MSFB. Since G is bipartite there exists two independent sets I1 and I2 of G partitioning V . W.l.o.g., we can assume that G has no isolated vertex. We construct a collection T = {T1 , T2 } of two rooted trees with leaf set E. The root of Ti is denoted ri . For each v ∈ Ii , let Xv be the non-empty star-tree whose leaf set is the set of all edges of E admitting v as an extremity (a star-tree, is a fan with an arbitrary number of leaves). The child subtrees of ri , are trees Xv with v ∈ Ii . 124 Vincent Berry et al. The transformation requires polynomial time and the size of the instance of 2-MCT in linear in the size of the instance of 4-MSFB. The correctness of the reduction follows by proving that for each subset F ⊆ E, F is a star-forest of G iff T1 | F and T2 | F are compatible. ⊓ ⊔ Proposition 5 yields the two main results of this section. On the first hand, we obtain: Theorem 5. The 2-CMCT problem is APX-hard. On the other hand, using the “self-improvement” technique of [17, 19] we deduce from Proposition 5 that 2-MCT is hard to approximate within constant ratio: Theorem 6. For any real constant δ < 1, the 2-MCT  problem  cannot be apδ proximated within ratio 2log n , unless NP ⊆ DTIME 2polylog n . The analoguous to Theorem 6 for 3-MAST is [17, Theorem 3]. 4.4 Hardness of Approximating MCT on Unbounded Number of Trees For the general MCT problem we can find non-approximability results stronger than Theorem 6. Approximating MCT on collections of n-leaf trees is at least as hard as approximating MIS on n-vertex graphs. The proof consists in an approximation preserving reduction from MIS to MCT, similar to the reduction from MIS to MAST described in [5]. Since MIS is very hard to approximate [16] (see also [9]), we obtain: Theorem 7. For all real ǫ > 0, MCT is not approximable within ratio (1 + ǫ)n1−ǫ unless NP = ZPP, resp. within ratio (1 + ǫ)n0.5−ǫ unless P = NP. Note that Theorem 7 still holds if MCT is restricted to collections of trees containing at least a binary tree. Remark that using the approximating via partitioning paradigm [14], one can approximate MAST within n/ log n [18]. This also holds for MCT. References 1. P. Alimonti and V. Kann. Some APX-completeness results for cubic graphs. Theor. Comput. Sci., 237(1–2):123–134, 2000. 2. A. Amir and D. Keselman. Maximum agreement subtree in a set of evolutionary trees: metrics and efficient algorithm. SIAM J. on Comput., 26(6):1656–1669, 1997. 3. V. Berry and F. Nicolas. Maximum agreement and compatible supertrees. In 15th Annual Symposium on Combinatorial Pattern Matching (CPM’04), volume 3109 of LNCS, pages 205–219, 2004. 4. V. Berry and F. Nicolas. Improved parametrized complexity of maximum agreement subtree and maximum compatible tree problems. IEEE Trans. on Comput. Biology and Bioinf., (to appear). On the Approximation of Computing Evolutionary Trees 125 5. P. Bonizzoni, G. Della Vedova, and G. Mauri. Approximating the maximum isomorphic agreement subtree is hard. Int. J. of Found. of Comput. Sci., 11(4):579–590, 2000. 6. D. Bryant. Building trees, hunting for trees and comparing trees: theory and method in phylogenetic analysis. PhD thesis, University of Canterbury, Department of Mathemathics, 1997. 7. R. Cole, M. Farach-Colton, R. Hariharan, T. M. Przytycka, and M. Thorup. An O(n log n) algorithm for the Maximum Agreement SubTree problem for binary trees. SIAM J. on Comput., 30(5):1385–1404, 2001. 8. G. F. Eastabrook and F. R. McMorris. When is one estimate of evolutionary relationships a refinement of another? J. of Math. Biol., 10:367–373, 1980. 9. L. Engebretsen and J. Holmerin. Towards optimal lower bounds for clique and chromatic number. Theor. Comput. Sci., 299(1–3):537–584, 2003. 10. M. Farach, T. M. Przytycka, and M. Thorup. On the agreement of many trees. Inf. Proces. Letters, 55(6):297–301, 1995. 11. G. Ganapathy and T. J. Warnow. Approximating the complement of the maximum compatible subset of leaves of k trees. In 5th Int. Workshop on Approximation Algorithms for Combinatorial Optimization (APPROX’02), volume 2462 of LNCS, pages 122–134, 2002. 12. G. Ganapathysaravanabavan and T. J. Warnow. Finding a maximum compatible tree for a bounded number of trees with bounded degree is solvable in polynomial time. In 1st Int. Workshop on Algorithms in Bioinformatics (WABI’01), volume 2149 of LNCS, pages 156–163, 2001. 13. A. Gupta and N. Nishimura. Finding largest subtrees and smallest supertrees. Algorithmica, 21(2):183–210, 1998. 14. M. M. Halldòrsson. Approximations of weighted independent set and hereditary subset problems. J. of Graph Algor. and Appl., 4(1), 2000. 15. A. M. Hamel and M. A. Steel. Finding a maximum compatible tree is NP-hard for sequences and trees. Appl. Math. Letters, 9(2):55–59, 1996. 16. J. Håstad. Clique is hard to approximate within n1−ǫ . Acta Math., 182:105–142, 1999. 17. J. Hein, T. Jiang, L. Wang, and K. Zhang. On the complexity of comparing evolutionary trees. Disc. Appl. Math., 71(1–3):153–169, 1996. 18. J. Jansson, J. H.-K. Ng, K. Sadakane, and W.-K. Sung. Rooted maximum agreement supertrees. In 6th Latin American Symposium on Theoretical Informatics (LATIN’04), volume 2976 of LNCS, pages 499–508, 2004. 19. T. Jiang and M. Li. On the approximation of shortest common supersequences and longest common subsequences. SIAM J. on Comput., 24(5):1122–1139, 1995. 20. M.-Y. Kao, T. W. Lam, W.-K. Sung, and H.-F. Ting. A decomposition theorem for maximum weight bipartite matchings with applications to evolutionary trees. In 7th Annual European Symposium on Algorithms (ESA’99), volume 1643 of LNCS, pages 438–449, 1999. 21. M.-Y. Kao, T. W. Lam, W.-K. Sung, and H.-F. Ting. An even faster and more unifying algorithm for comparing trees via unbalanced bipartite matchings. J. of Algor., 40(2):212–233, 2001. 22. M. A. Steel and T. J. Warnow. Kaikoura tree theorems: Computing the maximum agreement subtree. Inf. Proces. Letters, 48(2):77–82, 1993.