On the Approximation of Computing Evolutionary Trees
Vincent Berry, N. Francois, Sylvain Guillemot, Christophe Paul
To cite this version:
Vincent Berry, N. Francois, Sylvain Guillemot, Christophe Paul. On the Approximation of
Computing Evolutionary Trees. Lusheng Wang. COCOON’05: 11th Annual International
Conference on Computing and Combinatorics, 2005, pp.115-125, 2005, Lecture Notes in Computer Science. <lirmm-00106451>
HAL Id: lirmm-00106451
http://hal-lirmm.ccsd.cnrs.fr/lirmm-00106451
Submitted on 16 Oct 2006
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
115
On the Approximation
of Computing Evolutionary Trees
Vincent Berry⋆ , Sylvain Guillemot, François Nicolas, and Christophe Paul
Département Informatique, L.I.R.M.M. - C.N.R.S.
161 rue Ada, 34392 Montpellier Cedex 5
{vberry,sguillem,nicolas,paul}@lirmm.fr
Abstract. Given a set of leaf-labelled trees with identical leaf sets, the
well-known MAST problem consists of finding a subtree homeomorphically included in all input trees and with the largest number of leaves.
MAST and its variant called MCT are of particular interest in computational biology. This paper presents positive and negative results on the
approximation of MAST, MCT and their complement versions, denoted
CMAST and CMCT.
For CMAST and CMCT on rooted trees we give 3-approximation algorithms achieving significantly lower running times than those previously
known. In particular, the algorithm for CMAST runs in linear time.
The approximation threshold for CMAST, resp. CMCT, is shown to be
the same whenever collections of rooted trees or of unrooted trees are
considered. Moreover, hardness of approximation results are stated for
CMAST, CMCT and MCT on small number of trees, and for MCT on
unbounded number of trees.
1
Introduction
Given a set of leaf-labelled trees with identical leaf sets, the well-known Maximum Agreement SubTree problem (MAST) consists of finding a subtree
homeomorphically included in all input trees and with the largest number of
leaves [2, 7, 10, 13, 21, 22]. In other words, this involves selecting a largest set of
input leaves such that the input trees are isomorphic, i.e. agree with each other,
when restricted to these leaves.
This problem arises in various areas including phylogenetics which is concerned with evolutionary trees, i.e. trees representing the evolutionary history
of a set of species: the leaves of the tree are in one-to-one correspondence with
species under study and the branching pattern of the tree describes the way
in which speciation events lead from ancestral species to more recent ones. In
phylogenetics, the MAST problem is used to reach different practical goals: to
obtain a consensus of several trees inferred by different methods, or that are
optimal for a given criteria; to measure the similarity between different evolutionary scenarii; to identify horizontal transfers of genes. Recently, MAST has
⋆
Supported by the Act. Incit. Inf.-Math.-Phys. en Biol. Mol. [ACI IMP-Bio] and the
Act. Inter. Incit. Région. [BIOSTIC-LR].
L. Wang (Ed.): COCOON 2005, LNCS 3595, pp. 115–125, 2005.
c Springer-Verlag Berlin Heidelberg 2005
116
Vincent Berry et al.
been extended to the context of supertrees where input trees can have different
sets of leaves [3].
The Maximum Compatible Tree problem (MCT) is a variant of MAST
that is of particular interest in phylogenetics when the input trees are not binary [11, 12, 15, 17]. MCT requires that selected subtrees of the input trees are
compatible, i.e. that groups of leaves they define can all be combined in a same
tree. This is less strict than requiring the isomorphism of the subtrees, hence
usually leads to selecting a larger set of leaves than allowed by MAST.
We give below a brief overview of the litterature, precising how the results
presented in this paper relate to previously known results. The MAST problem is
NP-hard on three rooted trees of unbounded degree [2], and MCT on two rooted
trees if one of them is of unbounded degree [17]. Subquadratic algorithms have
been proposed for MAST on two rooted n-leaf trees [7, 20, 21]. When dealing
with k rooted trees, MAST can be solved in O(nd + kn3 ) time provided that
the degree of one of the input trees is bounded by d [2, 6, 10], and MCT can
be solved in O(22kd nk ) time provided that all input trees have degree bounded
by d [12]. Both problems can be solved in O(min{3p kn, 2.27p + kn3 }) time, i.e.
are FPT in p, where p is the smallest number of leaves to be removed from the
input set of leaves so that the input trees agree [3].
More generally, when the previously mentioned parameters are unbounded,
several works (starting from [2]) propose 3-approximation algorithms for CMAST
and CMCT, where CMAST, resp. CMCT, is the complement version of MAST,
resp. MCT, i.e. aims at selecting the smallest number of leaves to be removed
from the input trees in order to obtain their agreement. In practice, input trees
usually agree on the position of most leaves, thus approximating CMAST and
CMCT is more relevant than approximating MAST and MCT. For CMCT,
[11] propose an O(k 2 n2 ) time 3-approximation algorithm. We propose here an
O(n2 + kn) time algorithm. For MAST, [3] propose an O(kn3 ) time algorithm.
Here we improve on this result by providing a linear time, i.e. O(kn), algorithm.
We also state that rooted and unrooted versions of CMAST (and CMCT) have
the same approximation threshold.
Let k-MAST, resp. k-MCT, resp. k-CMAST, resp. k-CMCT, denote the particular case of MAST, resp. MCT, resp. CMAST, resp. CMCT, dealing with k
rooted trees. Negative results for these problems are as follows:
• For all ǫ > 0, the general MAST problem is not approximable within n1−ǫ
unless NP = ZPP [5]. A similar result is obtained here for MCT.
• It also stated here that 3-CMAST and 2-CMCT are APX-hard, i.e. that they
do not admit a PTAS unless P = NP.
δ
• For all
δ < 1, 3-MAST is not approximable within 2log n unless NP ⊆
polylog
n
[17]. The same result is obtained here for 2-MCT.
DTIME 2
2
Definitions and Preliminaries
A rooted evolutionary tree is a tree whose leaf set L(T ) is in bijection with a
label set, and whose internal nodes have at least two children. Hereafter, we only
On the Approximation of Computing Evolutionary Trees
117
consider such trees and identify leaves with their respective labels. The size of
a tree T (denoted #T ) is the number of its leaves: #T = #L(T ).
Let u be a node of a tree T , S(u) stands for the subtree rooted at u, L(u) for
the leaves of this subtree, and d+ (u) for the number of children of u. For a set
of leaves L ⊆ L(T ), lcaT (L) denotes the lowest common ancestor of leaves L in
T . Given a set L of labels and a tree T , the restriction of T to L, denoted T |L,
is the tree homeomorphic to the smallest subtree of T connecting leaves of L.
Lemma 1. Let T1 and T2 be two isomorphic trees with leaf set L, and let L′ ⊆ L,
then T1 |L′ is isomorphic to T2 |L′ .
Given a collection T = {T1 , T2 , . . . , Tk } of trees on a same leaf set L of
cardinality n, an agreement subtree of T is any tree T with leaves in L s.t.
∀Ti ∈ T , T = Ti |L(T ). The MAST problem consists in finding an agreement
subtree of T with the largest number of leaves. We denote M AST (T ) such a
tree.
A tree T refines a tree T ′ , if T ′ can be obtained by collapsing certain edges of
T , (i.e. merging their extremities). More generally, a tree T refines a collection T ,
whenever T refines all Ti ’s in T . Given a collection T of k trees with identical leaf
set L of cardinality n, a tree T with leaves in L is compatible with T iff ∀Ti ∈ T ,
T refines Ti |L(T ). If there is a tree T compatible with T s.t. L(T ) = L, i.e. that
is a common refinement of all trees in T , then the collection T is compatible.
In this case, a minimum refinement T of T (i.e. collapsing a minimum number
of edges) is a tree s.t. any tree T ′ refining T also refines T . Collections of trees
considered in practice are usually not compatible, motivating the MCT problem
which aims at finding a tree, denoted M CT (T ), compatible with T and having
the largest number of leaves. Remark that MCT is equivalent to MAST when
input trees are binary.
For any three leaves a, b, c in a tree T , there are only three possible binary
shapes for T |{a, b, c}, denoted a|bc, resp. b|ac, resp. c|ab, depending on their
innermost grouping of leaves (bc, resp. ac, resp. ab). These binary trees on 3
leaves are called rooted triples. Alternatively T |{a, b, c} can be a fan, i.e. a unique
internal node connected to the three leaves. A fan is denoted {a, b, c}.
We define rt(T ), resp. f (T ), as the set of rooted triples, resp. fans, induced
by the leaves of a tree T . Given a collection T = {T1 , T2 , . . . , Tk } of trees with
leaf set L, a set {a, b, c} ⊆ L is a hard conflict between (trees of) T whenever
∃Ti , Tj ∈ T s.t. a|bc ∈ rt(Ti ) and b|ac ∈ rt(Tj ). The set {a, b, c} is a soft conflict
between (trees of) T whenever a|bc ∈ rt(Ti ) and {a, b, c} ∈ f (Tj ).
Lemma 2 ([2, 3, 12]). Two trees with the same leaf set are isomorphic iff there
is no hard nor any soft conflict between them. A collection T of trees with the
same leaf set is compatible iff there is no hard conflict between T .
Definition 1. Given a set of conflicts C, let L(C) denote the leaves appearing in
C. Given a collection T with conflicts, an hs-peacemaker, resp. h-peacemaker,
of T is any set C of disjoint hard and soft, resp. only hard, conflicts between T
s.t. {Ti |(L − L(C)) : Ti ∈ T } is a collection of isomorphic trees, resp. compatible
118
Vincent Berry et al.
trees. In other words, removing L(C) from the input trees removes all conflicts,
resp. all hard conflicts, between them.
3
3.1
Approximation Algorithms
An O(n2 + kn) Time 3-Approximation Algorithm for CMCT
Let T be a collection of trees on an n-leaf set L. It is well-known that T is
compatible iff every pair of trees in T is compatible [8]. Moreover,
Lemma 3 ([4]). T is a compatible
collection of trees iff there exists a minimum
refinement T of T and rt(T ) = Ti ∈T rt(Ti ).
If T is compatible, a minimum refinement T of T is a solution for MCT, as
L(T ) = L. From Lemma 2, one can obtain T by first computing a minimum
refinement T1,2 of two trees T1 , T2 ∈ T , and then iterating on T −{T1 , T2 }∪{T1,2 }
until only one tree remains that is the sought tree T .
If T is not compatible, then we apply the following:
Lemma 4 ([2, 3, 11]). Let T = {T1 , T2 , . . . , Tk } be a collection of trees on a
leaf set L and let C be an hs-peacemaker, resp. an h-peacemaker, of T . Then any
tree in T |(L − L(C)) is a 3-approximation for CMAST, resp. any refinement of
T |(L − L(C)) is a 3-approximation for CMCT, on T .
Given a pair of trees, [4] give an O(n) time algorithm that either returns a
minimum refinement when the trees are compatible, or otherwise identifies a
hard conflict C between them. Thus, from Lemma 4, the procedure sketched
above for a compatible collection, can be adapted to obtain a 3-approximation
of CMCT for a non-compatible collection T . Apply the algorithm of [4] to a pair
of trees {T1 , T2 } ⊆ T to obtain either their minimum refinement T1,2 or a hard
conflict C. In the latter case, remove C from all input trees and iterate. In the
former case, iterate on T − {T1, T2 } ∪ {T1,2}. When T is reduced to a single tree,
O(k + n) calls to the algorithm of [4] have been issued and the resulting set C of
removed conflicts is an h-peacemaker. Hence:
Theorem 1. The CMCT problem on a collection of k rooted trees on a same
n-leaf set can be 3-approximated in O(n2 + kn) time.
3.2
A Linear Time 3-Approximation Algorithm for CMAST
W.l.o.g., this section considers input trees on a same n-leaf set labelled by positive integers 1, 2, . . . , n. First consider collections T of two trees. The following
characterization of isomorphic trees is the basis of our algorithm.
Lemma 5 ([4]). Two trees Ti and Tj are isomorphic iff rt(Ti ) = rt(Tj ) and
f (Ti ) = f (Tj ).
On the Approximation of Computing Evolutionary Trees
119
The definition of M AST (T ) is independent of the order of the children of
nodes in trees. However, to efficiently compute an approximation of M AST (T ),
we considered that T1 and T2 are ordered. Ordering a tree T consists in totally
ordering the children of every node in T . Thereby, this uniquely defines a leftright order πT on the leaves L of T .
Given an arbitrary ordering of T1 , the approximation algorithm first tries to
order T2 accordingly. In the following, π1 , resp. π2 , stands for πT1 , resp. πT2 ; and
π2 (i) stands for the i-th leaf in π2 . W.l.o.g., we also assume π1 = 1 . . . n.
Definition 2. Let π be an order on a set L. A subset S of L is an interval of
π whenever the elements of S occur consecutively in π (but not necessarily in
the same order). A tree T with leaf set L is embeddable in an order π on L
whenever T can be ordered s.t. πT = π.
Lemma 6. Let T be a tree with leaf set L and π be an arbitrary order of L.
Then, T is embeddable in π iff for any node u of T , L(u) is an interval of π.
Proposition 1. Let T be a tree and π be an order on its leaves. Testing whether
T is embeddable in π costs O(n) time. In the positive, ordering T such that
πT = π can be done in O(n) time.
The running time stated in this proposition is achieved by performing bottomup walks on disjoint paths in T , as described by Algorithm 1. For a node u in a
tree, let m(u) and M (u) resp. denote the smallest and largest leaf of L(u) in π.
Assume the children of any non-leaf node v ∈ T are originally stored in a doublylinked list lc (v) which has to be ordered into a list lc′ (v) so that πT |L(v) = π|L(v).
Algorithm 1: TreeOrder(T, π)
for any node u in T do lc′ (u) ← ∅ ;
for i = 1 to n do
let u be the leaf s.t. u = π −1 (i);
repeat
Let v be the parent node of u in T ;
Remove u from lc (v) and put it at the end of lc′ (v);
u ← v;
until i = m(u) or u is the root;
Due to the existence of conflicting triples, two arbitrary trees T1 and T2 with
same leaf set L may not be embeddable in a common order of L. If so, we can
however show the following:
Proposition 2. Let T1 , T2 be trees with leaf set L = {1, . . . , n}. In time O(n)
it
is possible
to
identify
a
set
C
of
disjoint
conflicts
between
T
and
T
s.t.
T
|
L−
1
2
2
L(C) is embeddable in π1 | L − L(C) .
120
Vincent Berry et al.
Below is given a sketch of the proof for this proposition. Let u be a node in a tree
T with leaf set L and π be an arbitrary order on L. If an element x ∈ L − L(u)
is s.t. m(u) <π x <π M (u), then prevπ (x, u), resp. nextπ (x, u), stands for the
maximum, resp. minimum, element of L(u) w.r.t. π that is smaller, resp. larger,
than x.
Lemma 7. Let T1 , T2 be trees on a leaf set L ⊆ {1, . . . , n} and let {a, b, c} ⊆ L.
If both a <π1 b <π1 c and ac | b ∈ rt(T2 ), then {a, b, c} is a conflict between
T1 and T2 . In particular, for a node u of T2 and a leaf x ∈
/ L(u) s.t. m(u) <π1
x <π1 M (u) then {prevπ1 (x, u), x, nextπ1 (x, u)} is a conflict between T1 and T2 .
This lemma guides the search of T2 to remove leaves (in T2 and T1 ) forming a
set of disjoint conflicts C s.t. for any node u of T2 |(L − L(C)), L(u) is an interval
of leaves in π1 |(L−L(C)). Such a node u is then said to be full. When all nodes of
the resulting T2 are full, Lemma 6 ensures that T2 is embeddable in the left-right
order of the tree T1 |(L − L(C)).
Nodes of T2 are processed in post-order, such that the children of a node u are
known to be full when u is processed. For efficiency reasons, a list LI of disjoint
intervals of π1 is also maintained sorted w.r.t. to π1 . LI is initially composed
of unit intervals ({1}, . . . , {n}) corresponding to leaves of T2 . Then intervals of
LI are merged or removed while processing nodes of T2 so as to maintain the
following invariant:
Invariant 1. Any interval of the list LI contains the leaf set L(u) of some node
u of T2 that is full w.r.t. π1 | L − L(C) .
When a non-full node u is processed in the traversal of T2 , this invariant
together with pointers from each children of u to the corresponding elements
ordered in LI enables us (according to Lemma 7) to efficiently identify conflicts
whose removal turns u into a full node. Note that Invariant 1 is robust under
the removal of a leaf in L(v) for any processed node v.
Lemma 8. Let T1 , T2 be two trees with leaf set in L and u be the current node
of T2 to be processed by the bottom-up algorithm ( i.e. the children of u are full
w.r.t. π1 ). Then a set hs(u)
of disjoint conflicts between {T1 , T2 } s.t. u is full
w.r.t. π1 | L − L(hs(u)) is found in time O(d+ (u) + |hs(u)|).
Proposition 2 follows from Lemma 7, Invariant 1 and Lemma 8. Given two
arbitrary trees T1 , T2 , propositions 1 and 2 show that, in linear time, disjoint
conflicts can be removed and children of nodes in T2 ordered s.t. the two resulting
trees have the same left-right order on their leaves. Thus, from now on, assume
that π1 = π2 . For convenience, even if some leaves have been removed, we note
π1 = 1 . . . n. Even if T1 and T2 have the same left-right order on their leaves,
they may still host conflicting triples. However, let us show that a post-order
search of T1 (or equiv. T2 ) is sufficient to remove such conflicts.
Definition 3. Let u be a node in a tree T , then rt(u) is the subset of triples
x|yz ⊆ rt(T ) s.t. # {x, y, z} ∩ L(u) ≥ 2, and f (u) is the set of fans {x, y, z} ⊆
f (T ) s.t. {x, y, z} ⊆ L(u). Define a node u in tree T1 to be valid w.r.t. tree T2
if both rt(u) ⊆ rt(T2 ) and f (u) ⊆ f (T2 ) hold.
On the Approximation of Computing Evolutionary Trees
121
Note that if r1 is the root node of tree T1 , then rt(r1 ) = rt(T1 ) and f (r1 ) = f (T1 ).
Moreover, given a tree T2 s.t. L(T2 ) = L(T1 ), the validity of r1 w.r.t. T2 implies
that T1 and T2 are isomorphic, as any 3-leaf set is either a rooted triple or a
fan of both trees. Next lemma is the basis of a recursive process to obtain the
validity of r1 w.r.t. T2 .
Lemma 9. Let u be a node of T1 whose children, denoted c1 , . . . , cd+ (u) , are all
valid. Let p(m(u)), resp. s(M (u)), be the leaf preceding m(u), resp. succeeding
M (u), in π2 if it exists.
1. if {p(m(u))|m(u)M (u), s(M (u))|m(u)M (u)} ⊆ rt(T2 ) then rt(u) ⊆ rt(T2 )
2. if u has only two children then f (u) ⊆ f (T2 )
3. if u has at least three children and for any i ∈ {1, 2, . . . , d+ (u) − 2},
{m(ci ), m(ci+1 ), m(ci+2 )} ∈ f (T2 ), then f (u) ⊆ f (T2 ).
Lemma 9 implies that if every node u ∈ T1 is processed after its children,
examining only O(d+ (u)) 3-leaf sets is enough to know whether a node u ∈ T1 is
already valid. When a conflict is encountered during this examination, its leaves
are removed from the trees.
Indeed, thanks to Lemma 1, removing a leaf in S(u) does not change the
pre-established validity of inner nodes of S(u). Thus, if c(u) denotes the number of such encountered conflicts, ensuring the validity of u involves looking at
O(d+ (u) + c(u)) 3-leaf sets. See Algorithm 2 for a complete description of the
procedure. Note that persistent dummy leaves can be artificially added at the
beginning and end of π1 and π2 s.t. p(m(u)) and s(M (u)) always exist for any
processed
node u. Processingthe whole tree T1 globally involves O(n) 3-leaf sets
+
as u∈T1 c(u) = O(n) and
u∈T1 d (u) = O(n) .
Provided π1 is stored in a doubly-linked list; symmetric pointers are maintained between a node u ∈ T1 to be processed, and the two elements of π1
that are the leftmost and rightmost leaves of S(u); and T2 is preprocessed so
as to identify in O(1) the least common ancestor of any two of its nodes; then
Algorithm 2 runs in linear time. Hence,
Theorem 2. The CMAST problem on a collection of k rooted trees with same
n-leaf set can be 3-approximated in O(kn) time.
The reader should notice that the above algorithms can be realized simultaneously by a single search of the tree. According to Proposition 1, Proposition 2
and Algorithm 2, the case k = 2 is solved in O(n) time. Handling a collection
T = {T1 , T2 , . . . , Tk } of k > 2 trees is done as for the MCT problem (see Section 3.1), i.e. by successively considering pairs of trees in T . This procedure runs
in O(nk) and, from Lemma 4, provides a 3-approximation of CMAST for T .
4
Inapproximability Results for MAST and MCT
In this section, we first state that the rooted and unrooted versions of CMAST
(equiv. CMCT) have the same approximation threshold. Then we detail new
negative results concerning the approximation of MCT, CMAST and CMCT.
122
Vincent Berry et al.
Algorithm 2: AgreementSubtree (T1 , T2 )
Input: Two rooted trees s.t. π1 = π2
for each node u in a post order traversal of T1 do
/* Ensures that rt(u) ⊆ rt(T2 ) */
repeat
m(u) ← leftmost leaf of S(u) ; M (u) ← rightmost leaf of S(u)
p(m(u)) ← leaf preceding m(u) in π1 ; f (M (u)) ← leaf following M (u)
in π1
if p(m(u))|m(u)M (u) ∈
/ rt(T2 ) then remove p(m(u)), m(u), M (u)
from T1 and T2
else if f (M (u))|m(u)M (u) ∈
/ rt(T2 ) then remove f (M (u)), m(u),
M (u) from T1 and T2
until {p(m(u))|m(u)M (u), f (M (u))|m(u)M (u)} ⊆ rt(T2 ) or d+ (u) < 2
/* Ensures that f (u) ⊆ f (T2 ) */
i←1
while d+ (u) > 2 and i ≤ d+ (u) − 2 do
let c1 , c2 , . . . , cd+ (u) be the children of u
if {m(ci ), m(ci+1 ), m(ci+2 )} ∈ f (T2 ) then i ← i + 1
else remove m(ci ), m(ci+1 ), m(ci+2 ) from T1 and T2
return T1
4.1
Rooted and Unrooted Versions of CMAST (equiv. CMCT)
Share the Same Approximation Threshold
Let ϕ(n, k) be a function in Ω(n × k).
Proposition 3. Let ρ ≥ 1 be a real constant. Assume there exists a ρ-approximation algorithm for CMAST, resp. CMCT, on rooted trees with O(ϕ(n, k))
running time. Then, there exists a ρ-approximation algorithm for CMAST,
resp. CMCT, on unrooted trees with O(n × ϕ(n − 1, k)) running time.
Proposition 3 is implicitely used in [11] and is proved in the following way. Let
U be a collection of unrooted trees. To ρ-approximate CMAST, resp. CMCT, on
instance U, apply the hypothetical ρ-approximation algorithm to each collection
obtained by rooting all trees in U at a same leaf. Then, return the best of the n
computed solutions. Combining Theorem 2 and Proposition 3, resp. Theorem 1
and Proposition 3, we obtain that the unrooted version of CMAST, resp. CMCT,
is 3-approximable in O(kn2 ), resp. O(n3 + kn2 ), time. Using a simple padding
argument yields the converse of Proposition 3:
Proposition 4. Let ρ ≥ 1 be a rational constant. Assume there exists a
ρ-approximation algorithm for CMAST, resp. CMCT, on unrooted trees with
O(ϕ(n, k)) running time. Then, there exists a ρ-approximation algorithm for
CMAST, resp. CMCT, on rooted trees with O(ϕ(n + ⌈ρn⌉ , k)) running time.
4.2
Hardness of Approximating CMAST on Three Trees
Theorem 3. The 3-CMAST problem is APX-hard.
On the Approximation of Computing Evolutionary Trees
123
Since 2-MAST (and thus, 2-CMAST) can be exactly solved in polynomial time
[21], Theorem 3 is somehow tight. Its proof relies on a careful reading of [17]
which states that the general 3-MAST problem is APX-hard. In fact [17] proves
that a restriction of 3-MAST to a certain set of instances is APX-hard. CMAST
is not considered in [17], but it is easy to see that for this particular set of
instances, 3-MAST L-reduces to 3-CMAST
4.3
Hardness of Approximating MCT and CMCT on Two Trees
In order to prove Theorems 5 (APX-hardness of 2-CMCT) and 6 (inapproximability of 2-CMCT), we define an intermediate problem, called Maximum
Star-Forest (MSF). Let G = (V, E) be a graph. A star-forest of G is a subset
of E which does not contain any path of length 3. The MSF problem is: “given a
graph G, find a star-forest of G that is of maximum cardinality” For each integer
∆ ≥ 1, we denote by ∆-MSFB the restriction of MSF to bipartite input graphs
having maximum degree at most ∆. The restriction of the Maximum Independent Set (shortly MIS) to input graphs having maximum degree at most 3 is
denoted 3-MIS. Note that 3-MIS is APX-complete [1].
Theorem 4. The 4-MSFB problem is APX-hard.
Proof (sketch). We use an L-reduction from 3-MIS to 4-MSFB relying on the
following transformation. Let G = (V, E) be an instance of 3-MIS (i.e. a graph
with maximum degree at most 3), we construct an instance G′ = (V ′ , E ′ ) of
4-MSFB as follows.
V ′ := V ∪ {γe : e ∈ E} ∪ {σv , τv : v ∈ V } ,
E ′ := {u, γe }, {γe , v} : e = {u, v} ∈ E ∪ {v, σv }, {σv , τv }, : v ∈ V .
Clearly, G′ can be obtained from G in polynomial time, and #V ′ = m + 3n and
#E ′ = 2m + 2n, where n and m denote the cardinality of V and E resp.
⊓
⊔
Theorem 4 leads to the following result:
Proposition 5. 2-MCT is APX-hard even if it is restricted to collections T of
two rooted trees satisfying #M CT (T ) ≥ 14 × n, where n denotes the size of each
tree in T .
Proof (sketch). We use an L-reduction from 4-MSFB to 2-MCT relying on the
following transformation. Let G = (V, E) be an instance of 4-MSFB. Since G
is bipartite there exists two independent sets I1 and I2 of G partitioning V .
W.l.o.g., we can assume that G has no isolated vertex. We construct a collection
T = {T1 , T2 } of two rooted trees with leaf set E. The root of Ti is denoted ri .
For each v ∈ Ii , let Xv be the non-empty star-tree whose leaf set is the set of all
edges of E admitting v as an extremity (a star-tree, is a fan with an arbitrary
number of leaves). The child subtrees of ri , are trees Xv with v ∈ Ii .
124
Vincent Berry et al.
The transformation requires polynomial time and the size of the instance of
2-MCT in linear in the size of the instance of 4-MSFB. The correctness of the
reduction follows by proving that for each subset F ⊆ E, F is a star-forest of G
iff T1 | F and T2 | F are compatible.
⊓
⊔
Proposition 5 yields the two main results of this section. On the first hand, we
obtain:
Theorem 5. The 2-CMCT problem is APX-hard.
On the other hand, using the “self-improvement” technique of [17, 19] we deduce
from Proposition 5 that 2-MCT is hard to approximate within constant ratio:
Theorem 6. For any real constant δ < 1, the 2-MCT
problem
cannot be apδ
proximated within ratio 2log n , unless NP ⊆ DTIME 2polylog n .
The analoguous to Theorem 6 for 3-MAST is [17, Theorem 3].
4.4
Hardness of Approximating MCT
on Unbounded Number of Trees
For the general MCT problem we can find non-approximability results stronger
than Theorem 6. Approximating MCT on collections of n-leaf trees is at least
as hard as approximating MIS on n-vertex graphs. The proof consists in an
approximation preserving reduction from MIS to MCT, similar to the reduction
from MIS to MAST described in [5]. Since MIS is very hard to approximate [16]
(see also [9]), we obtain:
Theorem 7. For all real ǫ > 0, MCT is not approximable within ratio (1 +
ǫ)n1−ǫ unless NP = ZPP, resp. within ratio (1 + ǫ)n0.5−ǫ unless P = NP.
Note that Theorem 7 still holds if MCT is restricted to collections of trees
containing at least a binary tree. Remark that using the approximating via partitioning paradigm [14], one can approximate MAST within n/ log n [18]. This
also holds for MCT.
References
1. P. Alimonti and V. Kann. Some APX-completeness results for cubic graphs. Theor.
Comput. Sci., 237(1–2):123–134, 2000.
2. A. Amir and D. Keselman. Maximum agreement subtree in a set of evolutionary
trees: metrics and efficient algorithm. SIAM J. on Comput., 26(6):1656–1669, 1997.
3. V. Berry and F. Nicolas. Maximum agreement and compatible supertrees. In 15th
Annual Symposium on Combinatorial Pattern Matching (CPM’04), volume 3109
of LNCS, pages 205–219, 2004.
4. V. Berry and F. Nicolas. Improved parametrized complexity of maximum agreement subtree and maximum compatible tree problems. IEEE Trans. on Comput.
Biology and Bioinf., (to appear).
On the Approximation of Computing Evolutionary Trees
125
5. P. Bonizzoni, G. Della Vedova, and G. Mauri. Approximating the maximum isomorphic agreement subtree is hard. Int. J. of Found. of Comput. Sci., 11(4):579–590,
2000.
6. D. Bryant. Building trees, hunting for trees and comparing trees: theory and method
in phylogenetic analysis. PhD thesis, University of Canterbury, Department of
Mathemathics, 1997.
7. R. Cole, M. Farach-Colton, R. Hariharan, T. M. Przytycka, and M. Thorup. An
O(n log n) algorithm for the Maximum Agreement SubTree problem for binary
trees. SIAM J. on Comput., 30(5):1385–1404, 2001.
8. G. F. Eastabrook and F. R. McMorris. When is one estimate of evolutionary relationships a refinement of another? J. of Math. Biol., 10:367–373, 1980.
9. L. Engebretsen and J. Holmerin. Towards optimal lower bounds for clique and
chromatic number. Theor. Comput. Sci., 299(1–3):537–584, 2003.
10. M. Farach, T. M. Przytycka, and M. Thorup. On the agreement of many trees.
Inf. Proces. Letters, 55(6):297–301, 1995.
11. G. Ganapathy and T. J. Warnow. Approximating the complement of the maximum
compatible subset of leaves of k trees. In 5th Int. Workshop on Approximation
Algorithms for Combinatorial Optimization (APPROX’02), volume 2462 of LNCS,
pages 122–134, 2002.
12. G. Ganapathysaravanabavan and T. J. Warnow. Finding a maximum compatible
tree for a bounded number of trees with bounded degree is solvable in polynomial
time. In 1st Int. Workshop on Algorithms in Bioinformatics (WABI’01), volume
2149 of LNCS, pages 156–163, 2001.
13. A. Gupta and N. Nishimura. Finding largest subtrees and smallest supertrees.
Algorithmica, 21(2):183–210, 1998.
14. M. M. Halldòrsson. Approximations of weighted independent set and hereditary
subset problems. J. of Graph Algor. and Appl., 4(1), 2000.
15. A. M. Hamel and M. A. Steel. Finding a maximum compatible tree is NP-hard for
sequences and trees. Appl. Math. Letters, 9(2):55–59, 1996.
16. J. Håstad. Clique is hard to approximate within n1−ǫ . Acta Math., 182:105–142,
1999.
17. J. Hein, T. Jiang, L. Wang, and K. Zhang. On the complexity of comparing evolutionary trees. Disc. Appl. Math., 71(1–3):153–169, 1996.
18. J. Jansson, J. H.-K. Ng, K. Sadakane, and W.-K. Sung. Rooted maximum agreement supertrees. In 6th Latin American Symposium on Theoretical Informatics
(LATIN’04), volume 2976 of LNCS, pages 499–508, 2004.
19. T. Jiang and M. Li. On the approximation of shortest common supersequences and
longest common subsequences. SIAM J. on Comput., 24(5):1122–1139, 1995.
20. M.-Y. Kao, T. W. Lam, W.-K. Sung, and H.-F. Ting. A decomposition theorem for
maximum weight bipartite matchings with applications to evolutionary trees. In
7th Annual European Symposium on Algorithms (ESA’99), volume 1643 of LNCS,
pages 438–449, 1999.
21. M.-Y. Kao, T. W. Lam, W.-K. Sung, and H.-F. Ting. An even faster and more
unifying algorithm for comparing trees via unbalanced bipartite matchings. J. of
Algor., 40(2):212–233, 2001.
22. M. A. Steel and T. J. Warnow. Kaikoura tree theorems: Computing the maximum
agreement subtree. Inf. Proces. Letters, 48(2):77–82, 1993.