Obviously, the numerical taxonomy problem is equally relevant to other historical sciences with an evolutionary branching process leading to evolutionary distances, e.g., historical linguistics.
More generally, if we can approximate distances with a tree metric, then the tree of this metric provides a very compact and convenient representation that is much easier to navigate than a general distance function. Picking any root, the tree induces a distance-based hierarchical classification, and referring to the discussions between Plato and Aristotle’s, humans have been interested in such hierarchical classifications since ancient times.
1.3 History of Lp Tree Fitting
Since Cavalli-Sforza and Edwards introduced the tree fitting problem, the problem has collected an extensive literature. In 1977 [
54], it was shown that if there is a tree metric coinciding exactly with
\(\mathcal {D}\), then it is unique and it can be found in time linear in the input size, i.e.,
\(O(|S|^2)\) time. The same then also holds trivially for ultrametrics. Unfortunately there is typically no tree metric coinciding exactly with
\(\mathcal {D}\), and in 1987 [
28] it was shown that for
\(L_1\) and
\(L_2\) the numerical taxonomy problem is NP-hard in the tree metric and the ultrametric cases. The problems are in fact APX-hard (see Section
9), which rules out the possibility of a polynomial-time approximation scheme. Thus, a constant factor, like ours for
\(L_1\), is the best one can hope for from a complexity perspective for these problems.
For the
\(L_\infty\) numerical taxonomy problem, there was much more progress. In 1993 [
34], it was shown that for the ultrametric case an optimal solution can be found in
\(\Theta (|S|^2)\) time. More recently, it was shown that when the points are embedded into
\(\mathbb {R}^d\) and the distances are given by the pairwise Euclidean distances, the problem can be approximated in subquadratic time [
22,
25]. For the general trees case (still in the
\(L_\infty\)-norm objective), Reference [
2] gave an
\(O(|S|^2)\) algorithm that produces a constant factor approximation and proved that the problem is APX-hard (unlike the ultrametric case).
The technical result from Reference [
2] was a general reduction from general tree metrics to ultrametrics. It modifies the input distance matrix and asks for fitting this new input with an ultrametric that can later be converted to a tree metric for the original distance matrix. The result states that for any
p, if we can minimize the restricted ultrametric
\(L_p\) error within a factor
\(\alpha\) in polynomial-time, then there is a polynomial-time algorithm that minimizes the tree metric
\(L_p\) error within a factor
\(3\alpha\). The reduction from Reference [
2] imposes a certain restriction on the ultrametric, but the restriction is not problematic, and in Section
8, we will even argue that the restriction can be completely eliminated with a slightly modified reduction. With
n species, the reduction from tree metrics to ultrametrics can be performed in time
\(O(n^2)\). Applying this to the optimal ultrametric algorithm from Reference [
34] for the
\(L_\infty\)-norm objective yielded a factor 3 for general metrics for the
\(L_\infty\)-norm objective. The generic impact is that
for any \(L_p\), later algorithms only had to focus on the ultrametric case to immediately get results for the often more important tree metrics case, up to losing a factor 3 in the approximation guarantee. Indeed, the technical result of this article is a constant factor approximation for ultrametric. Thus, letting TreeMetric and UltraMetric respectively denote the approximation factors of the tree metric and ultrametric problems, we have
For
\(L_p\)-norms with constant
p, the developments have been much slower. Ma et al. [
44] considered the problem of finding the best
\(L_p\) fit by an ultrametric where distances in the ultrametric are no smaller than the input distances. For this problem, they obtained an
\(O(n^{1/p})\) approximation.
Later, Dhamdhere [
29] considered the problem of finding a line metric to minimize additive distortion from the given data (measured by the
\(L_1\)-norm) and obtained an
\(O(\log n)\) approximation. In fact, his motivation for considering this problem was to develop techniques that might be useful for finding the closest tree metric with distance measured by the
\(L_1\)-norm. Harb, Kannan, and McGregor [
40] developed a factor
\(O(\min \lbrace n,k\log n\rbrace ^{1/p})\) approximation for the closest ultrametric under the
\(L_p\)-norm where
k is the number of distinct distances in the input.
The best bounds known for the ultrametric variant of the problem are due to Ailon and Charikar [
3]. They first focus on ultrametrics in
\(L_1\) and show that if the distance matrix has only
k distinct distances, then it is possible to approximate the
\(L_1\) error within a factor
\(k+2\). Next they obtain an LP-based
\(O((\log n)(\log \log n))\) approximation for arbitrary distances matrices. Finally, they sketch how it can be generalized to an
\(O(((\log n)(\log \log n))^{1/p})\) approximation of the
\(L_p\) error for any
p. Using the reduction from Reference [
2], they also get an
\(O(((\log n)(\log \log n))^{1/p})\) approximation for tree metrics under the
\(L_p\)-norm objective. The
\(O(((\log n)(\log \log n))^{1/p})\) approximation comes from an
\(O((\log n)(\log \log n))\) approximation of the
pth moment of the following function:
Technically, Ailon and Charikar [
3] present a simple LP relaxation for
\(L_1\) ultrametric fitting—an LP that will also be used in our article. They get their
\(O((\log n)(\log \log n))\) approximation using an LP rounding akin to the classic
\(O(\log n)\) rounding of Leighton and Rao for multicut [
43]. The challenge is to generalize the approach to deal with the hierarchical issues associated with ultrametric and show that this can be done paying only an extra factor
\(O(\log \log n)\) in the approximation factor. Next they show that their LP formulation and rounding is general enough to handle different variants, including other
\(L_p\)-norms as mentioned above, but also they can handle the
weighted case, where for each pair of species
\(i,j\), the error contribution to the overall error is multiplied by a value
\(w_{ij}\). However, this weighted problem captures the multicut problem (and the weighted minimization correlation clustering problem) [
3]. Since the multicut cannot be approximated within a constant factor assuming the unique games conjecture [
19] and the best-known approximation bound remains
\(O(\log n)\), it is beyond reach of current techniques to do much better in these more general settings.
Ailon and Charikar [
3] conclude that “determining whether an
\(O(1)\) approximation can be obtained is a fascinating question. The LP formulation used in our [their] work could eventually lead to such a result.” For their main LP formulation for the (unweighted)
\(L_1\) ultrametric fitting, the integrality gap was only known to be somewhere between 2 and
\(O((\log n) (\log \log n))\). To break the
\(\log n\)-barrier, we must come up with a radically different way of rounding this LP and free ourselves from the multicut-inspired approach.
For \(L_1\) ultrametric fitting, we give the first constant factor approximation, and we show this can be obtained by rounding the LP proposed by Ailon and Charikar, thus demonstrating a constant integrality gap for their LP. Our solution breaks the \(\log n\) barrier using the special combinatorial structure of the \(L_1\) problem.
Stepping a bit back, having different solutions for different norms should not come as a surprise. As an analogue, take the generic problem of placing
k facilities in such a way that each of
n cities is close to the nearest facility. Minimizing the vector of distances in the
\(L_1\)-norm is called the
k-median problem. In the
\(L_2\)-norm, it is called the
k-means problem and in the
\(L_\infty\)-norm the
k-center problem. Indeed, while the complexity of the
k-center problem has been settled in the mid-1980s thanks to Gonzalez’s algorithm [
38], it remained a major open problem for the next 15 years to obtain constant factor approximation for the
k-median and the
k-means problems. Similarly, our understanding of the
k-means problem (
\(L_2\)-objective) remains poorer than our understanding of the
k-median problem, and the problem is in fact provably harder (no better than
\(1+8/e\)-approximation algorithm [
39], while
k-median can be approximated within a factor 2.675 [
9]).
For our tree fitting problem, the
\(L_\infty\)-norm has been understood since the 1990s, and our result shows that the
\(L_1\)-norm admits a constant factor approximation algorithm. The current status of affairs for tree and ultrametrics is summarized in Table
1. The status for
\(L_p\) tree fitting is that we have a good constant factor approximation if we want to minimize the total error
\(L_1\) or the maximal error
\(L_\infty\). For all other
\(L_p\)-norms, we only have the much weaker but general
\(O(((\log n)(\log \log n))^{1/p})\) approximation from Reference [
3]. In particular, we do not know if anything better is possible with
\(L_2\). The difference is so big that even if we are in a situation where we would normally prefer an
\(L_2\) approximation, our much better approximation guarantee with
\(L_1\) might be preferable.
1.4 Other Related Work
Computational Biology. Researchers have also studied reconstruction of phylogenies under stochastic models of evolution (see Farach and Kannan [
33] or Mossel et al. [
46] and the references therein; see also Henzinger et al. [
41]).
Finally, related to the hierarchical correlation clustering problem that we introduce in this article is the hierarchical clustering problem introduced by Dasgupta [
27] where the goal is, given a similarity matrix, to build a hierarchical clustering tree where the more similar two points are, the lower in the tree they are separated (formally, a pair
\((u,v)\) induces a cost of similarity(
\(u,v\)) times the size of the minimal subtree containing both
u and
v, the goal is to minimize the sum of the costs of the pairs). This has received a lot of attention in the past few years (References [
5,
14,
15,
16,
18,
23,
24,
45,
47], see also References [
1,
6,
13,
21]) but differs from our settings, since the resulting tree may not induce a metric space.
Metric Embeddings. There is a large body of work of metric embedding problems. For example, the metric violation distance problem asks to embed an arbitrary distance matrix into a metric space while minimizing the
\(L_0\)-objective (i.e., minimizing the number of distances that are not preserved in the metric space). The problem is considerably harder and is only known to admit an
\(O(\text{OPT}^{1/3})\)-approximation algorithm [
31,
32,
36], while no better than a 2 hardness of approximation is known. More practical results on this line of work includes References [
51] and [
37]. Sidiropoulos et al. [
48] also considered the problem of embedding into metric, ultrametric, and so on, while minimizing the total number of outlier points.
There is also a rich literature on metric embedding problems where the measure of interest is the multiplicative distortion, and the goal of the problem is to approximate the absolute distortion of the metric space (as opposed to approximating the optimal embedding of the metric space). Several such problems have been studied in the context of approximating metric spaces via tree metrics (e.g., References [
8,
30]). The objective of these works is very different, since they are focused on the absolute expected multiplicative distortion over all input metrics while we aim at approximating the
optimal expected additive distortion for each individual input metric.
While the embedding techniques developed in References [
8,
30] and related works have been very successful for designing approximation algorithms for various problems in a variety of contexts, they are not aimed at numerical taxonomy. Their goal is to do something for general metrics. However, for our tree-fitting problem, the idea is that the ground truth is a tree, e.g., a phylogeny, and that the distances measured, despite noise and imperfection of the model, are close to the metric of the true tree. To recover an approximation to the true tree, we therefore seek a tree that compares well against the best possible fit of a tree metric.
1.6 High-level Algorithm for Hierarchical Correlation Clustering
Our main technical contribution in this article is solving Hierarchical Correlation Clustering. Reductions from Ultrametric to Hierarchical Correlation Clustering, and from general Tree Metric to Ultrametric, are already known from References [
3,
40] and [
2], respectively. We discuss both reductions in Sections
7 and
8. This includes removing some restrictions in the reduction from Tree Metrics to Ultrametrics.
Focusing on Hierarchical Correlation Clustering, our input is the \(\ell\) weights \(\delta ^{(1)},\ldots ,\delta ^{(\ell)}\in \mathbb {R}_{\gt 0}\) and \(\ell\) edge sets \(E^{(1)},\ldots ,E^{(\ell)}~\subseteq ~{S\choose 2}\).
Step 1: Solve correlation clustering independently for each level. The first step in our solution is to solve the correlation clustering problem defined by
\(E^{(t)}\) for each level
\(t=1,\ldots ,\ell\) independently, thus obtaining an intermediate partitioning
\(Q_t\). As we mentioned in Section
1.5.1, this can be done so that
\(Q_t\) minimizes
\(|E^{(t)}\triangle E(Q_t)|\) within a constant factor.
Step 2: Solve hierarchical cluster agreement. We now use the
\(\ell\) weights
\(\delta ^{(1)},\ldots ,\delta ^{(\ell)}\in \mathbb {R}_{\gt 0}\) and
\(\ell\) partitions
\(Q^{(1)},\ldots ,Q^{(\ell)}\) of
S as input to the hierarchical cluster agreement problem, which we solve using an LP very similar to the one Ailon and Charikar [
3] used to solve general hierarchical correlation clustering within a
\(O((\log n)(\log \log n))\) factor. However, when applied to the special case of hierarchical cluster agreement, it allows an unusual kind of constant-factor LP rounding. We use the LP variables (which only indicate how much pairs of species should be together) to decide which sets from the input partitions are important to the hierarchy and which sets can be ignored. The important sets may not yet be hierarchical (or laminar), but we show that some small modifications suffice. This is done by a relatively simple combinatorial algorithm that modify the important sets bottom-up to generate the hierarchical output partitions
\(P^{(1)},\ldots , P^{(\ell)}\). The result is a poly-time constant factor approximation for hierarchical cluster agreement, that is,
The output partitions
\(P^{(1)},\ldots , P^{(\ell)}\) are also returned as output to the original hierarchical correlation clustering problem.
We now provide a high level overview and the intuition behind the hierarchical cluster agreement algorithm. The algorithm can be broadly divided into two parts.
LP cleaning. We start by optimally solving the LP based on the weights \(\delta ^{(1)},\dots ,\delta ^{(\ell)}\) and partitions \(Q^{(1)},\dots ,Q^{(\ell)}\). For each level \(t\in \lbrace 1,\ldots , \ell \rbrace\), we naturally think of the relevant LP variables as distances and call them LP distances. That is because a small value means that the LP wants the corresponding species to be in the same part of the output partition at level t and vice versa, while the LP constraints also enforce the triangle inequality. The weights \(\delta ^{(1)},\dots ,\delta ^{(\ell)}\) impact the optimal LP variables but will otherwise not be used in the rest of the algorithm.
Using the LP distances, we clean each set in every partition
\(Q^{(t)}\) independently. The objective of this step (LP-Cleaning - Algorithm
2) is to keep only the sets where most species are very close to each other and far away from the species not in the set. All other sets are disregarded. We process the sets of each level independently. The algorithm may only slightly modify sets that are not disregarded when processing their level. The property that we can clean each set independently to decide whether it is important or not, without looking at any other sets makes this part of our algorithm quite simple.
Omitting exact thresholds for simplicity, the algorithm works as follows. We process each set \(C_I\in Q^{(t)}\) by keeping only those species that are at very small LP distance from at least half of the other species in \(C_I\) and at large LP distance to almost all the species outside \(C_I\). Let us note that by triangle inequality and the pigeonhole principle, all species left in a set are at relatively small distance from each other. After this cleaning process, we only keep a set if at least \(90\%\) of its species are still intact, and we completely disregard it otherwise. The LP cleaning algorithm outputs the sequence \(L^{(*)}=(L^{(1)},\dots ,L^{(\ell)})\), where \(L^{(t)}\) is the family of surviving cleaned sets from \(Q^{(t)}\).
Derive hierarchy. Taking
\(L^{(*)}\) as input, in the next step the algorithm Derive-Hierarchy (Algorithm
3) computes a hierarchical partition
\(P^{(*)}=(P^{(1)},\dots ,P^{(\ell)})\) of
S. This algorithm works bottom-up while initializing an auxiliary bottom most level of the hierarchy with
\(|S|\) sets where each set is a singleton and corresponds to a species of
S. Then the algorithm performs
\(\ell\) iterations where at the
tth iteration it processes all the disjoint sets in
\(L^{(t)}\) and computes partition
\(P^{(t)}\) while ensuring that at the end of the iteration
\(P^{(1)},\dots ,P^{(t)}\) are hierarchical. An interesting feature of our algorithm is that while creating
\(P^{(t)}\) we do not make any changes to the lower partitions
\(P^{(1)},\dots ,P^{(t-1)}\). Next, we discuss how to compute
\(P^{(t)}\) given
\(L^{(t)}\) and the lower partitions
\(P^{(1)},\dots ,P^{(t-1)}\).
Consider a set \(C_{LP}\in L^{(t)}\). Now if for each lower level set \(C^{\prime }\) either \(C^{\prime }\cap C_{LP}=\emptyset\) or \(C^{\prime }\subseteq C_{LP}\), then introducing \(C_{LP}\) at level t does not violate the hierarchy property. Otherwise, let \(C^{\prime }\) be a lower level set such that \(C^{\prime }\cap C_{LP}\ne \emptyset\) and \(C^{\prime }\not\subseteq C_{LP}\). Note that we already mentioned, once created, \(C^{\prime }\) is never modified while processing upper level sets. Thus, to ensure the hierarchy condition, the algorithm can either extend \(C_{LP}\) so that it completely covers \(C^{\prime }\) or can discard the common part from \(C_{LP}\).
In the process of modifying \(C_{LP}\) (where we can add or discard some species from it), at any point we define the core of \(C_{LP}\) to be the part that comes from the set created initially. Now to resolve the conflict between \(C_{LP}\) and \(C^{\prime }\), we work as follows. If the core of \(C_{LP}\) intersects the core of \(C^{\prime }\), then we extend \(C_{LP}\) so that \(C^{\prime }\) becomes a subset of it. Omitting technical details, there are two main ideas here: First, we ensure that the number of species in \(C^{\prime }\) (respectively, \(C_{LP})\) that are not part of its core is negligible with respect to the size of \(C^{\prime }\) (respectively, \(C_{LP})\). Furthermore, since the cores of \(C_{LP}, C^{\prime }\) have at least one common species, using triangle inequality we can claim that any pair of species from the cores of \(C^{\prime }, C_{LP}\) also have small LP distance; therefore, nearly all pairs of species in \(C_{LP}, C^{\prime }\) have small LP distance, meaning that the extension of \(C_{LP}\) is desirable (i.e., its cost is within a constant factor from the LP cost while it ensures the hierarchy).
Here we want to emphasize the point that because of the LP-cleaning, we can ensure that for any lower level set \(C^{\prime }\) at level t there exists at most one set whose core has an intersection with the core of \(C^{\prime }\). We call this the hierarchy-friendly property of the LP cleaned sets. This property is crucial for consistency, as it ensures that at level t there cannot exist more than one sets that are allowed to contain \(C^{\prime }\) as a subset.
In the other case, where the cores of \(C_{LP}\) and \(C^{\prime }\) do not intersect, the algorithm removes \(C_{LP}\cap C^{\prime }\) from \(C_{LP}\). The analysis of this part is more technical but follows the same arguments, namely using the aforementioned properties of LP-cleaning along with triangle inequality.
After processing all the sets in \(L^{(t)}\), the algorithm naturally combines these processed sets with \(P^{(t-1)}\) to generate \(P^{(t)}\), thus ensuring that \(P^{(1)}, \ldots , P^{(t)}\) are hierarchical.
High-level analysis. We will prove that the partitions \(P^{(1)},\ldots , P^{(\ell)}\) solves the original hierarchical clustering problem within a constant factor.
Using triangle inequality, we are going to show that the switch in Step 1, from the input edge sets
\(E^{(1)},\ldots ,E^{(\ell)}\) to the partitions
\(Q^{(1)},\ldots ,Q^{(\ell)}\) costs us no more than the approximation factor of correlation clustering used to generate each partition. This then becomes a multiplicative factor on our approximation factor for hierarchical cluster agreement, more specifically,
We will even show that we can work with the LP from Reference [
3] for the original hierarchical correlation clustering problem, and get a solution demonstrating a constant factor integrality gap.