We describe a hierearchical algorithm for computing optimum bases of certain matroids defined on ... more We describe a hierearchical algorithm for computing optimum bases of certain matroids defined on graphs. Two families of matroids are introduced and polynomial-time algorithms for determining costs of their optimum bases are presented.
We study the parameter space decomposition induced by parametric optimization problems where the ... more We study the parameter space decomposition induced by parametric optimization problems where the score of each feasible solution is a linear function with integer coefficients. We show that for a large class of problems the number of regions in the decomposition is polynomial in the length of the input. The proof uses geometric duality and a classical result on lattice polytopes. We apply the result to re-derive a known bound for parametric stable marriage and to obtain new ones for parametric phylogeny construction and sequence comparison.
Gene families are groups of genes that have descended from a common ancestral gene present in the... more Gene families are groups of genes that have descended from a common ancestral gene present in the species under study. Current, widely used gene family building algorithms can produce family clusters that may be fragmented or missing true family sequences (under-clustering). Here we present a classification method based on sequence pairs that, first, inspects given families for under-clustering and then predicts the missing sequences for the families using family-specific alignment score cutoffs. We have tested this method on a set of curated, gold-standard (“true”) families from the Yeast Gene Order Browser (YGOB) database, including 20 yeast species, as well as a test set of intentionally under-clustered (“deficient”) families derived from the YGOB families. For 83% of the modified yeast families, our pair-classification method was able to reliably detect under-clustering in “deficient” families that were missing 20% of sequences relative to the full/” true” families. We also atte...
Given a collection of leaf-labeled trees on a common leafset and a fraction f ∈ ( 1 2 , 1 ] , a f... more Given a collection of leaf-labeled trees on a common leafset and a fraction f ∈ ( 1 2 , 1 ] , a frequent subtree (FST) is a subtree isomorphically included in at least fraction f of the input trees. The well-known maximum agreement subtree (MAST) problem identi es FST with f = 1 and having the largest number of leaves. Apart from its intrinsic interest from the algorithmic perspective, MAST has practical applications as a metric for tree similarity, for computing tree congruence, in detection horizontal gene transfer events and as a consensus approach. Enumerating FSTs extend the MAST problem by de nition and reveal additional subtrees not displayed by MAST. This can happen in two ways such a subtree is included in majority but not all of the input trees or such a subtree though included in all the input trees, does not have the maximum number of leaves. Further, FSTs can be enumerated on collections of trees having partially overlapping leafsets. MAST may not be useful here especially if the common overlap among leafsets is very low. Though very useful, the number of FSTs su er from combinatorial explosion just a single MAST can exhibit exponentially many FSTs. This limits both the size of the trees that can be enumerated and the ability to comprehend enumerated FSTs. To overcome this, we propose enumeration of maximal frequent subtrees (MFSTs). A MFST is a FST that is not a subtree to any other FST. The set of MFSTs is a compact non-redundant summary of all FSTs and is much smaller in size. Here we tackle the novel problem of enumerating all MFSTs in collections of phylogenetic trees. We demonstrate its utility in returning larger consensus trees in comparison to MAST. The current implementation is available on the web.
A framework for solving certain multidimensional parametric search problems in randomized linear ... more A framework for solving certain multidimensional parametric search problems in randomized linear time is presented, along with its application to optimization on matroids, including parametric minimum spanning trees on planar and dense graphs.
We consider the following geometric alignment problem: Given a set of line segments in the plane,... more We consider the following geometric alignment problem: Given a set of line segments in the plane, find a convex region of smallest area that contains a translate of each input segment. This can be seen as a generalization of Kakeya’s problem of finding a convex region of smallest area such that a needle can be turned through 360 degrees within this region. Our main result is an optimal Θ(n log n)-time algorithm for our geometric alignment problem, when the input is a set of n line segments. We also show that, if the goal is to minimize the perimeter of the region instead of its area, then the optimum placement is when the midpoints of the segments coincide. Finally, we show that for any compact convex figure G, the smallest enclosing disk of G is a smallest-perimeter region containing a translate of any rotated copy of G.
We present an algorithm for determining whether a set of species, describedby the characters they... more We present an algorithm for determining whether a set of species, describedby the characters they exhibit, has a perfect phylogeny, assuming the maximumnumber of characters is fixed. This algorithm is simpler and faster than the knownalgorithms when the number of characters is at least 4.1 IntroductionA fundamental problem in biology is that of inferring the evolutionary history of a setof species, each of which is specified by the set of traits or characters that it exhibits [6, 7, 10, 11]. Information about evolutionary history can ...
Suppose we have a set $X$ consisting of $n$ taxa and we are given information from $k$ loci from ... more Suppose we have a set $X$ consisting of $n$ taxa and we are given information from $k$ loci from which to construct a phylogeny for $X$. Each locus offers information for only a fraction of the taxa. The question is whether this data suffices to construct a reliable phylogeny. The decisiveness problem expresses this question combinatorially. Although a precise characterization of decisiveness is known, the complexity of the problem is open. Here we relate decisiveness to a hypergraph coloring problem. We use this idea to (1) obtain lower bounds on the amount of coverage needed to achieve decisiveness, (2) devise an exact algorithm for decisiveness, (3) develop problem reduction rules, and use them to obtain efficient algorithms for inputs with few loci, and (4) devise an integer linear programming formulation of the decisiveness problem, which allows us to analyze data sets that arise in practice.
Background A semi-labeled tree is a tree where all leaves as well as, possibly, some internal nod... more Background A semi-labeled tree is a tree where all leaves as well as, possibly, some internal nodes are labeled with taxa. Semi-labeled trees encompass ordinary phylogenetic trees and taxonomies. Suppose we are given a collection $${\mathcal {P}}= \{{\mathcal {T}}_1, {\mathcal {T}}_2, \ldots , {\mathcal {T}}_k\}$$ P = { T 1 , T 2 , … , T k } of semi-labeled trees, called input trees, over partially overlapping sets of taxa. The agreement problem asks whether there exists a tree $${\mathcal {T}}$$ T , called an agreement tree, whose taxon set is the union of the taxon sets of the input trees such that the restriction of $${\mathcal {T}}$$ T to the taxon set of $${\mathcal {T}}_i$$ T i is isomorphic to $${\mathcal {T}}_i$$ T i , for each $$i \in \{1, 2, \ldots , k\}$$ i ∈ { 1 , 2 , … , k } . The agreement problems is a special case of the supertree problem, the problem of synthesizing a collection of phylogenetic trees with partially overlapping taxon sets into a single supertree that...
We study a long standing conjecture on the necessary and sufficient conditions for the compatibil... more We study a long standing conjecture on the necessary and sufficient conditions for the compatibility of multi-state c There exists a function f(r) such that, for any set C of r-state characters, C is compatible if and only if every subset of f(r) characters of C is compatible. We show that for every r ≥ 2, there exists an incompatible set C of ⌊ r ⌋·⌈ r ⌉+1 r-state characters such that every proper subset of C is compatible. Thus, f(r) ≥ ⌊ r ⌋ · ⌈ r ⌉ + 1 for every r ≥ 2. This improves the previous lower bound of f(r) ≥ r given by Meacham (1983), and generalizes the construction showing that f(4) ≥ 5 given by Habib and To (2011). We prove our result via a result on quartet compatibility that may be of independent interest: For every integer n ≥ 4, there exists an incompatible set Q of ⌊ n 2 2 ⌋·⌈ n 2 2 ⌉+ 1 quartets over n labels such that every proper subset of Q is compatible. We contrast this with a result on the compatibility of triplets: For every n ≥ 3, if R is an incompatible...
Phylogenetic trees represent evolutionary relationships among sets of organisms. Popular phylogen... more Phylogenetic trees represent evolutionary relationships among sets of organisms. Popular phylogenetic reconstruction approaches typically yield hundreds to thousands of trees on a common leafset. Storing and sharing such large collection of trees requires considerable amount of space and bandwidth. Furthermore, the huge size of phylogenetic tree databases can make search and retrieval operations time-consuming. Phylogenetic compression techniques are specialized compression techniques that exploit redundant topological information to achieve better compression of phylogenetic trees. Here, we present EvoZip, a new approach for phylogenetic tree compression. On average, EvoZip achieves 71.6% better compression and takes 80.71% less compression time and 60.47% less decompression time than TreeZip, the current state-of-the-art algorithm for phylogenetic tree compression. While EvoZip is based on TreeZip, it betters TreeZip due to (a) an improved bipartition and support list encoding sch...
We describe a hierearchical algorithm for computing optimum bases of certain matroids defined on ... more We describe a hierearchical algorithm for computing optimum bases of certain matroids defined on graphs. Two families of matroids are introduced and polynomial-time algorithms for determining costs of their optimum bases are presented.
We study the parameter space decomposition induced by parametric optimization problems where the ... more We study the parameter space decomposition induced by parametric optimization problems where the score of each feasible solution is a linear function with integer coefficients. We show that for a large class of problems the number of regions in the decomposition is polynomial in the length of the input. The proof uses geometric duality and a classical result on lattice polytopes. We apply the result to re-derive a known bound for parametric stable marriage and to obtain new ones for parametric phylogeny construction and sequence comparison.
Gene families are groups of genes that have descended from a common ancestral gene present in the... more Gene families are groups of genes that have descended from a common ancestral gene present in the species under study. Current, widely used gene family building algorithms can produce family clusters that may be fragmented or missing true family sequences (under-clustering). Here we present a classification method based on sequence pairs that, first, inspects given families for under-clustering and then predicts the missing sequences for the families using family-specific alignment score cutoffs. We have tested this method on a set of curated, gold-standard (“true”) families from the Yeast Gene Order Browser (YGOB) database, including 20 yeast species, as well as a test set of intentionally under-clustered (“deficient”) families derived from the YGOB families. For 83% of the modified yeast families, our pair-classification method was able to reliably detect under-clustering in “deficient” families that were missing 20% of sequences relative to the full/” true” families. We also atte...
Given a collection of leaf-labeled trees on a common leafset and a fraction f ∈ ( 1 2 , 1 ] , a f... more Given a collection of leaf-labeled trees on a common leafset and a fraction f ∈ ( 1 2 , 1 ] , a frequent subtree (FST) is a subtree isomorphically included in at least fraction f of the input trees. The well-known maximum agreement subtree (MAST) problem identi es FST with f = 1 and having the largest number of leaves. Apart from its intrinsic interest from the algorithmic perspective, MAST has practical applications as a metric for tree similarity, for computing tree congruence, in detection horizontal gene transfer events and as a consensus approach. Enumerating FSTs extend the MAST problem by de nition and reveal additional subtrees not displayed by MAST. This can happen in two ways such a subtree is included in majority but not all of the input trees or such a subtree though included in all the input trees, does not have the maximum number of leaves. Further, FSTs can be enumerated on collections of trees having partially overlapping leafsets. MAST may not be useful here especially if the common overlap among leafsets is very low. Though very useful, the number of FSTs su er from combinatorial explosion just a single MAST can exhibit exponentially many FSTs. This limits both the size of the trees that can be enumerated and the ability to comprehend enumerated FSTs. To overcome this, we propose enumeration of maximal frequent subtrees (MFSTs). A MFST is a FST that is not a subtree to any other FST. The set of MFSTs is a compact non-redundant summary of all FSTs and is much smaller in size. Here we tackle the novel problem of enumerating all MFSTs in collections of phylogenetic trees. We demonstrate its utility in returning larger consensus trees in comparison to MAST. The current implementation is available on the web.
A framework for solving certain multidimensional parametric search problems in randomized linear ... more A framework for solving certain multidimensional parametric search problems in randomized linear time is presented, along with its application to optimization on matroids, including parametric minimum spanning trees on planar and dense graphs.
We consider the following geometric alignment problem: Given a set of line segments in the plane,... more We consider the following geometric alignment problem: Given a set of line segments in the plane, find a convex region of smallest area that contains a translate of each input segment. This can be seen as a generalization of Kakeya’s problem of finding a convex region of smallest area such that a needle can be turned through 360 degrees within this region. Our main result is an optimal Θ(n log n)-time algorithm for our geometric alignment problem, when the input is a set of n line segments. We also show that, if the goal is to minimize the perimeter of the region instead of its area, then the optimum placement is when the midpoints of the segments coincide. Finally, we show that for any compact convex figure G, the smallest enclosing disk of G is a smallest-perimeter region containing a translate of any rotated copy of G.
We present an algorithm for determining whether a set of species, describedby the characters they... more We present an algorithm for determining whether a set of species, describedby the characters they exhibit, has a perfect phylogeny, assuming the maximumnumber of characters is fixed. This algorithm is simpler and faster than the knownalgorithms when the number of characters is at least 4.1 IntroductionA fundamental problem in biology is that of inferring the evolutionary history of a setof species, each of which is specified by the set of traits or characters that it exhibits [6, 7, 10, 11]. Information about evolutionary history can ...
Suppose we have a set $X$ consisting of $n$ taxa and we are given information from $k$ loci from ... more Suppose we have a set $X$ consisting of $n$ taxa and we are given information from $k$ loci from which to construct a phylogeny for $X$. Each locus offers information for only a fraction of the taxa. The question is whether this data suffices to construct a reliable phylogeny. The decisiveness problem expresses this question combinatorially. Although a precise characterization of decisiveness is known, the complexity of the problem is open. Here we relate decisiveness to a hypergraph coloring problem. We use this idea to (1) obtain lower bounds on the amount of coverage needed to achieve decisiveness, (2) devise an exact algorithm for decisiveness, (3) develop problem reduction rules, and use them to obtain efficient algorithms for inputs with few loci, and (4) devise an integer linear programming formulation of the decisiveness problem, which allows us to analyze data sets that arise in practice.
Background A semi-labeled tree is a tree where all leaves as well as, possibly, some internal nod... more Background A semi-labeled tree is a tree where all leaves as well as, possibly, some internal nodes are labeled with taxa. Semi-labeled trees encompass ordinary phylogenetic trees and taxonomies. Suppose we are given a collection $${\mathcal {P}}= \{{\mathcal {T}}_1, {\mathcal {T}}_2, \ldots , {\mathcal {T}}_k\}$$ P = { T 1 , T 2 , … , T k } of semi-labeled trees, called input trees, over partially overlapping sets of taxa. The agreement problem asks whether there exists a tree $${\mathcal {T}}$$ T , called an agreement tree, whose taxon set is the union of the taxon sets of the input trees such that the restriction of $${\mathcal {T}}$$ T to the taxon set of $${\mathcal {T}}_i$$ T i is isomorphic to $${\mathcal {T}}_i$$ T i , for each $$i \in \{1, 2, \ldots , k\}$$ i ∈ { 1 , 2 , … , k } . The agreement problems is a special case of the supertree problem, the problem of synthesizing a collection of phylogenetic trees with partially overlapping taxon sets into a single supertree that...
We study a long standing conjecture on the necessary and sufficient conditions for the compatibil... more We study a long standing conjecture on the necessary and sufficient conditions for the compatibility of multi-state c There exists a function f(r) such that, for any set C of r-state characters, C is compatible if and only if every subset of f(r) characters of C is compatible. We show that for every r ≥ 2, there exists an incompatible set C of ⌊ r ⌋·⌈ r ⌉+1 r-state characters such that every proper subset of C is compatible. Thus, f(r) ≥ ⌊ r ⌋ · ⌈ r ⌉ + 1 for every r ≥ 2. This improves the previous lower bound of f(r) ≥ r given by Meacham (1983), and generalizes the construction showing that f(4) ≥ 5 given by Habib and To (2011). We prove our result via a result on quartet compatibility that may be of independent interest: For every integer n ≥ 4, there exists an incompatible set Q of ⌊ n 2 2 ⌋·⌈ n 2 2 ⌉+ 1 quartets over n labels such that every proper subset of Q is compatible. We contrast this with a result on the compatibility of triplets: For every n ≥ 3, if R is an incompatible...
Phylogenetic trees represent evolutionary relationships among sets of organisms. Popular phylogen... more Phylogenetic trees represent evolutionary relationships among sets of organisms. Popular phylogenetic reconstruction approaches typically yield hundreds to thousands of trees on a common leafset. Storing and sharing such large collection of trees requires considerable amount of space and bandwidth. Furthermore, the huge size of phylogenetic tree databases can make search and retrieval operations time-consuming. Phylogenetic compression techniques are specialized compression techniques that exploit redundant topological information to achieve better compression of phylogenetic trees. Here, we present EvoZip, a new approach for phylogenetic tree compression. On average, EvoZip achieves 71.6% better compression and takes 80.71% less compression time and 60.47% less decompression time than TreeZip, the current state-of-the-art algorithm for phylogenetic tree compression. While EvoZip is based on TreeZip, it betters TreeZip due to (a) an improved bipartition and support list encoding sch...
Uploads
Papers by David Fernández-Baca