Upper and lower bounds for complete linkage in general metric spaces

Arutyunova, Anna; Großwendt, Anna; Röglin, Heiko; Schmidt, Melanie; Wargalla, Julian

doi:10.1007/s10994-023-06486-8

Upper and lower bounds for complete linkage in general metric spaces

Open access
Published: 30 November 2023

Volume 113, pages 489–518, (2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Upper and lower bounds for complete linkage in general metric spaces

Download PDF

Anna Arutyunova¹,
Anna Großwendt¹,
Heiko Röglin^1,2,
Melanie Schmidt³ &
…
Julian Wargalla³

1492 Accesses
1 Altmetric
Explore all metrics

Abstract

In a hierarchical clustering problem the task is to compute a series of mutually compatible clusterings of a finite metric space $(P,{{\,\textrm{dist}\,}})$. Starting with the clustering where every point forms its own cluster, one iteratively merges two clusters until only one cluster remains. Complete linkage is a well-known and popular algorithm to compute such clusterings: in every step it merges the two clusters whose union has the smallest radius (or diameter) among all currently possible merges. We prove that the radius (or diameter) of every k-clustering computed by complete linkage is at most by factor O(k) (or $O(k^{\ln (3)/\ln (2)})=O(k^{1{.}59})$) worse than an optimal k-clustering minimizing the radius (or diameter). Furthermore we give a negative answer to the question proposed by Dasgupta and Long (J Comput Syst Sci 70(4):555–569, 2005. https://doi.org/10.1016/j.jcss.2004.10.006), who show a lower bound of $\Omega (\log (k))$ and ask if the approximation guarantee is in fact $\Theta (\log (k))$. We present instances where complete linkage performs poorly in the sense that the k-clustering computed by complete linkage is off by a factor of $\Omega (k)$ from an optimal solution for radius and diameter. We conclude that in general metric spaces complete linkage does not perform asymptotically better than single linkage, merging the two clusters with smallest inter-cluster distance, for which we prove an approximation guarantee of O(k).

Improved Analysis of Complete-Linkage Clustering

Article 06 February 2017

Improved Analysis of Complete-Linkage Clustering

Approximation Algorithms for Min-Sum k-Clustering and Balanced k-Median

Article 15 May 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The k-clustering problem asks for a partition of a point set in a metric space into k subsets (or clusters). To measure whether the data is clustered well, one option is to pick a center for every cluster and compute the maximum distance between a point and the center of its cluster. This objective is to be minimized and is known as k-center. A problem which is independent of the choice of centers is the k-diameter problem, where we want to minimize the maximum distance between two points lying in the same cluster. Observe that k-center and k-diameter are related to each other in the sense that for a fixed set P the cost of an optimal k-diameter clustering on P is at most twice the cost of an optimal k-center clustering, which again costs at most as much as an optimal k-diameter clustering. There are other objectives to measure the quality of a clustering where every point contributes to the cost of the clustering, for example k-median and k-means. Here we want to minimize the cost which equals the sum over all (squared) distances between a point and the center of its cluster.

The k-center problem is NP-hard to approximate with factor $\alpha <2$ (Hochbaum, 1984; Hsu & Nemhauser, 1979). This bound is tight, as both Gonzalez (1985) and Hochbaum and Shmoys (1985) show. Gonzalez’s 2-approximation algorithm is a simple, but elegant greedy approach. Starting with an arbitrary point $p_1 \in P$, one constructs an enumeration $P = \{p_1, \ldots , p_{|P|}\}$ by successively choosing as $p_{i+1}$ a point from P whose minimum distance to any point from $\{p_1, \ldots , p_i\}$ is maximal. Assigning every point from P to its closest neighbor among $p_1, \ldots , p_k$ (the centers) yields a 2-approximation for the k-center problem for all $k = 1, \ldots , |P|$. One can prove that the resulting clustering is also a 2-approximation to k-diameter. Another greedy approach is the reverse greedy algorithm, which starts with all data points as centers and iteratively removes a center such that the objective stays as small as possible. This algorithm computes an $\Theta (k)$-approximation as shown by Hershkowitz and Kehne (2020). Observe that both greedy algorithms compute an incremental clustering where the centers of a k-clustering are also centers of an $\ell$-clustering if $\ell \ge k$.

Gonzalez’s algorithm (Gonzalez, 1985) allows to compute good clusterings, even if one does not previously know an appropriate value for k. However, even successive clusterings computed by Gonzalez’s algorithm and reverse greedy can be radically different and so it can be difficult to compare them and select one that seems appropriate for the task.

Another greedy approach known as complete linkage starts with every point in its own cluster and consecutively merges two clusters whose union has the smallest radius (or diameter when considering the k-diameter objective) among all possible cluster pairs. If we proceed like this until only one cluster remains, we also obtain a k-clustering for any possible $1 \le k \le |P|$. However, this time, the resulting clusterings are also hierarchically compatible: for all $\ell \ge k$ the $\ell$-clustering is a refinement of the k-clustering. This makes it easier to compare such clusterings with each other and to choose an appropriate k-clustering. Also, this additional hierarchical structure is interesting in and of itself. Famous examples include phylogenetic trees that represent the relationship between animal species in biology.

A series of such hierarchically compatible clusterings ${\mathscr {C}}_1, \ldots , {\mathscr {C}}_{|P|}$ (with ${\mathscr {C}}_k$ being a k-clustering for all k) forms a hierarchical clustering. Complete linkage is a common and popular bottom-up approach to compute such a hierarchical clustering and can be generalized to fit any k-clustering objective, resulting in so called agglomerative clustering methods. For hierarchical k-means this is Ward’s method (Ward, 1963).

To evaluate a hierarchical clustering ${\mathscr {C}}_1, \ldots {\mathscr {C}}_{|P|}$ it is common to refer to the underlying k-clusterings: it is an $\alpha$-approximation if the cost of ${\mathscr {C}}_k$ is at most $\alpha$ times that of an optimal k-clustering for all $1\le k\le |P|$.

1.1 Related work

For hierarchical k-center and k-diameter, constant factor approximations are known. For both problems (Dasgupta & Long, 2005) and (Charikar et al., 2004) give a polynomial-time 8-approximation. The concept of nesting to solve hierarchical and incremental problems was introduced in Lin et al. (2010). Using this technique, every approximation algorithm for a k-clustering objective that satisfies a certain nesting property, which they define, can be converted into an algorithm for its hierarchical version. Especially k-median and k-means satisfy this property and thus (in combination with the currently best constant factor approximations for k-median (Byrka et al., 2017) and k-means (Ahmadian et al., 2020), polynomial time constant factor approximations do indeed exist for the hierarchical k-median/k-means problem. Nesting can also be applied to k-center/k-diameter but does not improve upon the 8-approximation.

As optimal k-clusterings are not necessarily hierarchically compatible, even assuming unlimited computation power 1-approximations do not exist in general. There exists an instance for the diameter and an instance for the radius where the best hierarchical clustering is a $(3+2\sqrt{2})$-approximation and 4-approximation, respectively (Arutyunova & Röglin, 2022). Furthermore there exists a $(3+2\sqrt{2})$-approximation for diameter (Arutyunova & Röglin, 2022; Bock, 2022) and a 4-approximation for radius (Großwendt, 2020), but it is not clear whether the respective hierarchical clusterings can be computed in polynomial time.

However, greedy algorithms are more common in practical applications. There exist several theoretical results on upper and lower bounds on the approximation factor for complete linkage. For metrics induced by norms in ${\mathbb {R}}^d$, especially the Euclidean metric (Ackermann et al., 2014) proves that complete linkage computes for both the k-center and k-diameter objective an $O(\log (k))$ approximation, assuming the dimension d to be constant. This was later improved to O(1) (Großwendt & Röglin, 2017). Both works distinguish between two variants of k-center: one where centers must be from the set P and the second where they can be arbitrary points chosen from the whole space ${\mathbb {R}}^d$. In the first case the approximation factor shown in Ackermann et al. (2014), Großwendt and Röglin (2017) depends linearly on d and in the second case exponentially on d. For the k-diameter problem it even depends doubly exponentially on the dimension. Furthermore Ackermann et al. prove for the $l_p$-metric with $1\le p<\infty$ a lower bound of $\Omega (\root p \of {\log (d)})$ for complete linkage for k-diameter and k-center with centers chosen from P (Ackermann et al., 2014).

Little is known about the approximation factor of complete linkage in general metric spaces. The best known lower bound is in $\Omega (\log (k))$ (Dasgupta & Long, 2005). With an approach to upper bound the increase in cost by a complete linkage merge, which we borrow from Ackermann et al. (2014), we obtain in a relatively straightforward manner an upper bound of $O(\log (|P|-k))$ for complete linkage for k-center.

There exist few results for agglomerative clustering regarding other objectives. The work Großwendt et al. (2019) analyzes Ward’s method for k-means, and show that if the clusters of an optimal k-means clustering are sufficiently far apart, Ward’s method computes a 2-approximation and under some additional assumptions in fact reconstructs the optimal clustering.

1.2 Our results

We study upper and lower bounds for the complete linkage algorithm in general metric spaces for the k-center and k-diameter objective. For k-center in general metric spaces it is reasonable to assume that centers can be only drawn from P and thus we only consider this variant. Our main results are:

A lower bound of $\Omega (k)$ for complete linkage for k-center and k-diameter, which improves the currently highest lower bound of $\Omega (\log (k))$ (Dasgupta & Long, 2005) significantly.
An upper bound of O(k) for k-center and an upper bound of $O(k^{\ln (3)/\ln (2)})=O(k^{1{.}59})$ for k-diameter, which are to the best of the authors’ knowledge the first non-trivial upper bounds for complete linkage in general metric spaces.

The lower bound $\Omega (k)$ is surprising as it shows that complete linkage does not perform asymptotically better than single linkage, which merges the two clusters with smallest distance to each other (the distance of two clusters is the smallest distance between two of their points). We know that the lower bound for single linkage is in $\Omega (k)$ (Dasgupta & Long, 2005) and we show that the approximation factor is in fact $\Theta (k)$. As single linkage is not designed to minimize the radius or diameter of emerging clusters, it is a natural assumption that it performs worse than complete linkage. However our results show that this assumption is generally not true. It is even still open if complete linkage for k-diameter performs as good as single linkage, as we are only able to prove an upper bound of $O(k^{\ln (3)/\ln (2)})$.

In conclusion complete linkage seems to perform as bad as single linkage for the radius. However in the definition of the approximation factor we always consider the largest ratio between the cost of the clustering computed by complete linkage and the optimal clustering as we vary over all possible cluster sizes. In fact complete linkage for k-center produces reasonable results for most of the cluster sizes especially when compared to single linkage. To go past the worst case definition of an approximation factor we therefore consider what approximation factor is achieved by both clustering methods on average. We define the average approximation factor of a hierarchical clustering as the average of all ratios between the cost of a k-clustering and an optimal k-clustering. We show that the average approximation factor of complete linkage for the radius is in $O(\log (n))$ while the average approximation factor of single linkage is in $\Omega (n)$. Thus complete linkage for k-center produces a better hierarchical clustering on average than single linkage.

1.3 Techniques

One of the biggest and most well-known issues concerning single linkage is that of chaining. If there is a sequence of points $x_1, \ldots , x_k \in P$ with ${{\,\textrm{dist}\,}}(x_i, x_{i+1})$ relatively small for all i, then single linkage might merge all of them together, despite the resulting cluster being quite large. Dasgupta and Long show with their lower bound of $\Omega (\log k)$ that a similar process of chaining can also occur when executing complete linkage. They give the example of points placed on a regular $(k \times k)$-grid with a spacing of 1. The distance is given by the sum of the discrete metric on the horizontal axis and the logarithm of the absolute value of the vertical axis. That is, ${{\,\textrm{dist}\,}}((x, y), (x', y')) = {\textbf{1}}_{x \ne x'} + \log _2(1 + |y - y'|)$. Now, although an optimal clustering just consists of the individual rows of the grid, complete linkage might reproduce the columns instead (assuming that k is a power of 2): iteratively go from top to bottom and merge vertically neighboring clusters. Every such iteration halves the number of clusters and, due to the logarithm, only increases the cost by 1, just as when merging along the rows. Of course, we would have to pay only once to merge horizontally, whereas we have to pay $\log _2 k$ times to merge vertically, but complete linkage cannot distinguish between these two cases. In fact, one can shift the vertical placement by arbitrarily small values to ensure that complete linkage always chooses the bad case.

We have to heavily modify the example to improve upon this $\log _2 k$ factor. The fundamental problem is this: a vertical merge is only allowed to increase the cost by 1 to tie it with any horizontal merge, whereas the number of rows occupied by a cluster (and thus its diameter) doubles. We raise the lower bound by constructing an instance on which complete linkage iteratively merges diagonally shifted clusters. This process of merging clusters is much slower and does not require us to introduce a logarithmic scaling: merging one such cluster into the other incurs a cost of 1, while at the same time increasing the number of occupied rows only by one. The instance that we describe later is successively built from smaller components that exhibit exactly this behaviour, while ensuring that any such merge does not pay for the whole row.

Following the work Ackermann et al. (2014) one can show for complete linkage an upper bound of $\log (|P|-k)$ for k-center. This comes from the following easy property, which is true for the radius but cannot be transferred to diameter: Suppose the optimal k-center solution ${\mathscr {O}}$ has cost x. In a complete linkage clustering consisting of more than k clusters two of its centers must lie in the same optimal cluster and therefore are at distance $\le 2x$ to each other. Thus the merge that is performed by complete linkage increases the cost by at most 2x. However if we replace k-center by k-diameter we see that the cost is more than doubled in the worst case (see Fig. 1), which is not enough to obtain an upper bound polynomial in k. Thus we introduce another perspective on the cost of a cluster. A cluster is good if its cost is small enough in comparison to the number of optimal clusters from ${\mathscr {O}}$ which it intersects. As ${\mathscr {O}}$ consists of k clusters this already implies a sufficiently small upper bound for good clusters. For all remaining clusters we show that their number is small enough. This approach leads to an upper bound of $O(k^{\ln (3)/\ln (2)})$ for k-diameter and, in combination with the $\log (|P|-k)$ upper bound, an upper bound of O(k) for k-center.

2 Preliminaries

Let $(P,{{\,\textrm{dist}\,}})$ be a metric space with n points and $1\le k \le n$. The k-center problem asks for a partition of P into k clusters ${\mathscr {C}}=\{C_1,\ldots , C_k\}$. The cost of cluster $C_i$ is given by ${{\,\textrm{cost}\,}}(C_i)=\min _{c\in C_i}\max _{x\in C_i}{{\,\textrm{dist}\,}}(x,c)$ while the cost of the clustering ${\mathscr {C}}$ is ${{\,\textrm{cost}\,}}({\mathscr {C}})=\max _{i=1,\ldots ,k}{{\,\textrm{cost}\,}}(C_i)$ and is to be minimized.

In the k-diameter problem we also have to find a partition of P into k clusters ${\mathscr {C}}=\{C_1,\ldots , C_k\}$ and minimize the overall cost. However, we replace the cost of a cluster $C_i$ by ${{\,\textrm{cost}\,}}(C_i)=\max _{x,y\in C_i} {{\,\textrm{dist}\,}}(x,y).$ For both the k-center and the k-diameter problem we denote by ${\mathscr {O}}_k$ an arbitrary but fixed optimal clustering.

We study the hierarchical version of the above problems, where we ask for a k-clustering ${\mathscr {C}}_k$ of P for every $1\le k\le n$. The clusterings must be hierarchically compatible, which means that ${\mathscr {C}}_{k-1}$ is obtained from ${\mathscr {C}}_k$ by merging two of its clusters, i.e., for all $2\le k \le n$ there are $A,B\in {\mathscr {C}}_k$ such that ${\mathscr {C}}_{k-1}={\mathscr {C}}_k\backslash \{A,B\}\cup \{A\cup B\}.$ A sequence of such k-clusterings $({\mathscr {C}}_k)_{k=1}^{n}$ is called a hierarchical clustering. We say that it is an $\alpha$-approximation if ${{\,\textrm{cost}\,}}({\mathscr {C}}_k)\le \alpha {{\,\textrm{cost}\,}}({\mathscr {O}}_k)$ for all $1\le k\le n$. Thus the task is to find a hierarchical clustering which is a good approximation to the optimal solution on every level k.

A common class of approaches for computing such hierarchical clusterings are agglomerative linkage algorithms. As outlined above, a hierarchical clustering can be computed in a bottom-up fashion, where pairs of clusters are merged successively. Agglomerative linkage procedures do exactly that, with the choice of clusters to be merged at every step given by a linkage function. Such a linkage function maps all possible pairs of disjoint clusters onto ${\mathbb {R}}_+$ and the algorithm chooses one pair that minimizes this value: Suppose that we have already constructed ${\mathscr {C}}_k$ and are using the linkage function f. Then ${\mathscr {C}}_{k-1}$ is given by merging a pair $A \ne B \in {\mathscr {C}}_k$ with $f(A, B) = \min _{A' \ne B' \in {\mathscr {C}}_k} f(A', B')$. As already stated, the two linkage functions we are interested in are:

Single linkage: $(A, B) \mapsto {{\,\textrm{dist}\,}}(A, B) = \min _{(a, b) \in A \times B} {{\,\textrm{dist}\,}}(a, b)$.
Complete linkage: $(A, B) \mapsto {{\,\textrm{cost}\,}}(A \cup B)$.

To analyze the performance of the respective agglomerative algorithms we often consider the smallest clustering from $({\mathscr {C}}_k)_{k=1}^{n}$ (in terms of the number of clusters) whose cost does not exceed a given bound. This perspective is already used in Großwendt and Röglin (2017) and allows a better handling of the cost. For any $x \ge 0$ let $t_{\le x} = \min \{k \mid {{\,\textrm{cost}\,}}({\mathscr {C}}_k) \le x\}$ and set ${\mathscr {H}}_x={\mathscr {C}}_{t_{\le x}}$. Observe that ${\mathscr {H}}_x$ is the smallest clustering from $({\mathscr {C}}_k)_{k=1}^{n}$ with cost at most x. Thus it has the useful property that every merge of two clusters in ${\mathscr {H}}_x$ results in a clustering of cost more than x. Furthermore, for a cluster $C \subseteq P$ and an optimal k-clustering ${\mathscr {O}}= {\mathscr {O}}_k$ we denote by ${\mathscr {O}}_C = \{O \in {\mathscr {O}}\, | \, O \cap C \ne \emptyset \}$ the set of all optimal k-clusters hit by C.

3 Approximation guarantee of single linkage

As outlined in Dasgupta and Long (2005) there are clustering instances where single linkage builds chains yielding the lower bound $\Omega (k)$ on the approximation factor. We show that this is the worst case scenario, as in fact single linkage computes an O(k)-approximation for hierarchical k-center/k-diameter.

Let $({\mathscr {C}}_k)_{i=1}^n$ be the hierarchical clustering computed by single linkage on $(P,{{\,\textrm{dist}\,}})$. Recall that ${\mathscr {C}}_{k-1}$ arises from ${\mathscr {C}}_{k}$ by merging two clusters $A, B \in {\mathscr {C}}_{k}$ that minimize ${{\,\textrm{dist}\,}}(A, B)$.

We first compare the radius of ${\mathscr {C}}_k$ to the cost of an optimal k-center clustering ${\mathscr {O}}$. We introduce a graph G whose vertices are the optimal clusters $V(G) = {\mathscr {O}}$ and whose edges $E(G) = \{ \{O, O'\} \subseteq {\mathscr {O}}\, | \, {{\,\textrm{dist}\,}}(O, O') \le 2{{\,\textrm{cost}\,}}({\mathscr {O}})\}$ connect all pairs of optimal clusters $O, O' \in {\mathscr {O}}$ with distance at most twice the optimal radius.

We make a similar construction to compare the diameter of ${\mathscr {C}}_k$ to the cost of an optimal k-diameter clustering ${\mathscr {O}}'$. We consider the graph $G'$ with $V(G')={\mathscr {O}}'$ where two clusters in ${\mathscr {O}}'$ are connected via an edge if their distance is at most ${{\,\textrm{cost}\,}}({\mathscr {O}}')$.

To estimate the cost of a single linkage cluster $C\in {\mathscr {C}}_k$ we look at the optimal clusters hit by C. The next lemma shows that for any two points in C we can find a path connecting them that passes through a chain of optimal clusters with distance at most $2{{\,\textrm{cost}\,}}({\mathscr {O}})$ or ${{\,\textrm{cost}\,}}({\mathscr {O}}')$ when considering the radius or diameter, respectively. One can already anticipate that this gives an upper bound of O(k) on the radius or diameter of any such cluster C. In Fig. 2 we see an example of such a cluster C and the optimal clusters hit by C.

Lemma 1

Let $C \in {\mathscr {C}}_t$ be a cluster computed by single linkage at a time step $t \ge k$. Then the graphs $G[{\mathscr {O}}_C]$ and $G'[{\mathscr {O}}'_C]$ induced by the vertex set of optimal clusters hit by C are connected.

Proof

We prove the lemma for $G[{\mathscr {O}}_C]$ by induction. At the beginning ($t = n$) the lemma obviously holds, since any cluster contained in ${\mathscr {C}}_n$ is a point and thus hits only one single optimal cluster. Assume now that the claim holds for $t > k$. By the pigeonhole principle there must exist two clusters $C, C' \in {\mathscr {C}}_t$ with two points $c \in C$ and $c' \in C'$ lying in the same optimal cluster $O \in {\mathscr {O}}$. We know that ${{\,\textrm{dist}\,}}(C, C') \le 2{{\,\textrm{cost}\,}}(O) \le 2{{\,\textrm{cost}\,}}({\mathscr {O}})$. But this value is exactly the objective that single linkage minimizes, so we know in particular that this upper bound also holds for the distance between the clusters $D, D'$ chosen by single linkage. Combining this with the induction hypothesis that both $G[{\mathscr {O}}_D]$ and $G[{\mathscr {O}}_{D'}]$ are connected finishes the proof. One proves analogously that $G'[{\mathscr {O}}'_C]$ is connected. $\square$

As we see in Fig. 2 this already yields an upper bound of $2k{{\,\textrm{cost}\,}}({\mathscr {O}}')$ on the diameter of C. We estimate the radius of C by looking at the paths going through optimal clusters in ${\mathscr {O}}_C$ that are at distance at most $2{{\,\textrm{cost}\,}}({\mathscr {O}})$ from one another. Choosing the center appropriately and uncoiling these paths in our original space P yields our upper bound of $(2k+2){{\,\textrm{cost}\,}}({\mathscr {O}})$.

Theorem 1

Let $({\mathscr {C}}_k)_{k=1}^{n}$ be the hierarchical clustering computed by single linkage on $(P,{{\,\textrm{dist}\,}})$ and let ${\mathscr {O}}_k$ be an optimal clustering for k-center or k-diameter, respectively. We have for all $1\le k\le n$

1.
${{\,\textrm{cost}\,}}({\mathscr {C}}_k) \le (2k + 2) \cdot {{\,\textrm{cost}\,}}({\mathscr {O}}_k)$ for the k-center cost
2.
${{\,\textrm{cost}\,}}({\mathscr {C}}_k) \le 2k \cdot {{\,\textrm{cost}\,}}({\mathscr {O}}_k)$ for the k-diameter cost.

Proof

We prove the statement for k-center. Fix an arbitrary time step $1 \le k \le n$ and denote ${\mathscr {O}}= {\mathscr {O}}_k$. Let $C \in {\mathscr {C}}_k$ be an arbitrary cluster and P a longest simple path in $G[{\mathscr {O}}_C]$. Choose as center for C an arbitrary vertex $c \in C \cap O$ from an optimal cluster O lying in the middle of P. Note that by this choice every other vertex in $G[{\mathscr {O}}_C]$ is reachable from O by a path of length at most $\frac{k}{2}$. Uncoiling such paths in P gives us an upper bound of $2(k+1){{\,\textrm{cost}\,}}({\mathscr {O}})$ for the distance between c and any other point $z \in C$ as follows: If $O_z \in {\mathscr {O}}$ is the optimal cluster containing z, then by choice of O, there exists a path $O = O_1, \ldots , O_{\ell + 1} = O_z$ in $G[{\mathscr {O}}_C]$ of length $\ell \le \frac{k}{2}$ connecting them. That means, for each $i = 1, \ldots , \ell$ there exist points $x_i \in O_i, y_{i+1} \in O_{i+1}$ such that ${{\,\textrm{dist}\,}}(x_i, y_{i+1}) \le 2{{\,\textrm{cost}\,}}({\mathscr {O}})$. Hence

$$\begin{aligned} {{\,\textrm{dist}\,}}(c, z)&\le {{\,\textrm{dist}\,}}(c, x_1) + \sum _{i=1}^{\ell -1} \left( {{\,\textrm{dist}\,}}(x_i, y_{i+1}) + {{\,\textrm{dist}\,}}(y_{i+1}, x_{i+1})\right) \\&\quad + {{\,\textrm{dist}\,}}(x_\ell , y_{\ell + 1}) + {{\,\textrm{dist}\,}}(y_{\ell +1}, z) \\&\le 2(2\ell + 1){{\,\textrm{cost}\,}}({\mathscr {O}}) \le 2(k + 1){{\,\textrm{cost}\,}}({\mathscr {O}}). \end{aligned}$$

Using Lemma 1 one proves the statement for k-diameter analogously. $\square$

4 Lower bounds for complete linkage

In the following we show that complete linkage performs asymptotically as bad as single linkage in the worst case. That is, for every $k \in {\mathbb {N}}$ we provide an instance $P_k$ on which the diameter and radius of a k-clustering computed by complete linkage is off by a factor of $\Omega (k)$ from the cost of an optimal solution. This improves upon the previously known lower bound of $\Omega (\log _2 (k))$ established by Dasgupta and Long. Recall from the introduction that one of the big problems preventing an improved lower bound was that any horizontal merge already paid for all the involved rows. As such, for the worst case, one was only allowed to merge vertically, but this can be done at most $\log _2(k)$ times. We improve upon this by inductively constructing an instance from smaller components that are diagonally shifted to produce bigger ones. Merging two such diagonally shifted components incurs an additional cost of 1, while ensuring at the same time that this does not pay for any future merges of parallel components.

A k-component $K_k = (G_k, \phi _k)$ is a combination of a graph $G_k=(V_k,E_k)$ and a mapping $\phi _k: V_k \rightarrow \{1, \ldots , k\}$. The mapping is necessary for the construction of the component and later on determines an optimal k-clustering on $P_k$. We refer to $\phi _k(x)$ as the level of x. The other part of the component is an undirected graph $G_k$, referred to as a k-graph, on $2^{k-1}$ points with edge weights in ${\mathbb {N}}$ that describe the distances between the levels.

The 1-component $K_1$ consists of a single point x with $\phi _1(x) = 1$. All higher components are constructed inductively from this 1-component. Given the $(k-1)$-component $K_{k-1}$ we construct $K_k$ as follows: Let $K_{k-1}^{(0)}$ and $K_{k-1}^{(1)}$ be two copies of the $(k-1)$-component $K_{k-1}$. For the k-graph $G_k$ we first take the disjoint union of the graphs $G_{k-1}^{(0)}$ and $G_{k-1}^{(1)}$. This already yields all the points of $G_k$. For the k-mapping $\phi _k$ we set $\phi _k(x) = \phi _{k-1}^{(i)}(x) + i$ for $x \in V(G_{k-1}^{(i)}) \subset V(G_k)$. That is, in the first copy the levels stay the same, whereas in the second all levels are shifted by 1. Finally, to complete $G_k$, we add one edge of weight $k-1$ from the unique point $s \in V(G_k)$ with $\phi _k(s) = 1$ to the unique point $t \in V(G_k)$ with $\phi _k(t) = k$. The progression of the first five components is given in Fig. 3.

The instance $P_k$ is now constructed from the k-component as follows: Let $K_k^{(1)}, \ldots , K_k^{(k+1)}$ be $k+1$ copies of $K_k$. Take the disjoint union of the corresponding k-graphs $G_k^{(1)}, \ldots , G_k^{(k+1)}$ and connect them by adding edges $\{x, y\}$ of weight 1 for every two points $x \in V(G_k^{(i)})$ and $y \in V(G_k^{(j)})$ with $\phi _k^{(i)}(x) = \phi _k^{(j)}(y)$. Note that the sets of points from the same level constitute cliques of diameter and radius 1 and form an optimal solution of cost 1. To simplify notation we omit the indices and write $\phi _k(x)$ to denote the level of a point $x \in V(G_k^{(j)}) \subset V(P_k)$. The distance between two points in $V(P_k)$ is given by the length of a shortest path.

Let $({\mathscr {C}}_{k'})_{k'=1}^{n}$ be the clustering produced by complete linkage on $(V(P_k), {{\,\textrm{dist}\,}})$ minimizing the radius or diameter. Recall that ${\mathscr {C}}_{k'-1}$ arises from ${\mathscr {C}}_{k'}$ by merging two clusters $A, B\in {\mathscr {C}}_{k'}$ that minimize the radius or diameter of $A\cup B$. Remember that $t_{\le x}=\min \{k'\mid {{\,\textrm{cost}\,}}({\mathscr {C}}_{k'})\le x\}$ and that ${\mathscr {H}}_x={\mathscr {C}}_{t_{\le x}}$ denotes the smallest clustering with cost smaller or equal to x. We show in the following two subsections that ${\mathscr {H}}_{k - 1}$ consists exactly of the $k+1$ different k-graphs that make up the instance resulting in the following theorem.

Theorem 2

For every $k\in {\mathbb {N}}$ there exists an instance $(V(P_k),{{\,\textrm{dist}\,}})$ on which complete linkage, minimizing either diameter or radius, computes a solution of diameter k or radius $\frac{k}{2}$, respectively, whereas the cost of an optimal solution is 1.

4.1 A lower bound for diameter-based cost

We start with the analysis for diameter-based costs and after that move on to radius-based costs.

Lemma 2

The distance between two points $x, y \in V(P_k)$ is at least as big as the difference in levels $|\phi _k(x) - \phi _k(y)|$.

Proof

By the inductive construction of the components, an edge of weight w can cross at most w levels. Hence the distance between x and y is at least $|\phi _k(x) - \phi _k(y)|$. $\square$

Consider an $\ell$-graph $G_{\ell }$. Instead of talking about the cluster $V(G_{\ell })$ in $(V(P_k),{{\,\textrm{dist}\,}})$ we slightly abuse our notation and see $G_{\ell }$ as a cluster with ${{\,\textrm{cost}\,}}(G_{\ell })=\max _{x,y\in V(G_l)}{{\,\textrm{dist}\,}}(x,y)$, i.e., the diameter of $V(G_{\ell })$. Using the previous lemma we can show inductively that the diameter of any $\ell$-graph in $P_k$ is $\ell - 1$.

Lemma 3

Let $G_\ell$ be an $\ell$-graph contained in $P_k$. We have ${{\,\textrm{cost}\,}}(G_\ell )=\ell - 1$.

Proof

We prove the upper bound ${{\,\textrm{cost}\,}}(G_\ell ) \le \ell - 1$ by induction. The 1-graphs are points and so the claim follows trivially for $\ell = 1$. Assume now that we have shown the claim for $\ell -1$. Let $s,t \in V(G_\ell )$ be points such that ${{\,\textrm{dist}\,}}(s,t) = {{\,\textrm{cost}\,}}(G_\ell )$. If these points lie in the same graph, say $G_{\ell -1}^{(0)}$, of the two $(\ell -1)$-graphs $G_{\ell -1}^{(0)}$ and $G_{\ell -1}^{(1)}$ that make up $G_\ell$, then

$$\begin{aligned} {{\,\textrm{cost}\,}}(G_\ell ) = {{\,\textrm{dist}\,}}(s,t) \le {{\,\textrm{cost}\,}}(G_{\ell -1}^{(0)}) \le \ell -2 < \ell - 1 \end{aligned}$$

by induction and we are done. Otherwise we may assume that $s\in V(G_{\ell -1}^{(0)})$ and $t\in V(G_{\ell -1}^{(1)})$. This leaves us with another case analysis. If s is the unique point with level 1 and t is the unique point in level $\ell$ in $G_{\ell }$ then we are again done, since by construction there exists an edge between s and t of weight $\ell - 1$. Otherwise one of s or t must share a level with a point not in the same $(\ell - 1)$-graph as themselves. Without loss of generality we may assume that s lies in the same level as some $u\in V(G_{\ell -1}^{(1)})$. By induction ${{\,\textrm{dist}\,}}(u,t) \le \ell - 2$ and so

$$\begin{aligned} {{\,\textrm{cost}\,}}(G_\ell ) = {{\,\textrm{dist}\,}}(s,t) \le {{\,\textrm{dist}\,}}(s,u) + {{\,\textrm{dist}\,}}(u,t) \le 1 + \ell - 2 = \ell -1. \end{aligned}$$

This concludes the proof of the upper bound ${{\,\textrm{cost}\,}}(G_{\ell }) \le \ell - 1$.

To see the lower bound ${{\,\textrm{cost}\,}}(G_{\ell }) \ge \ell - 1$, we apply Lemma 2 to the unique point s with level 1 and the unique point t with level $\ell$ in $G_{\ell }$. This shows that ${{\,\textrm{cost}\,}}(G_{\ell }) \ge {{\,\textrm{dist}\,}}(s,t) \ge \ell - 1$. $\square$

The goal now is to show that complete linkage actually reconstructs these graphs as clusters. We already computed the cost of an $\ell$-graph and now it is left to observe that merging two $\ell$-graphs costs at least $\ell$.

Lemma 4

Complete linkage might merge clusters on $(V(P_k),{{\,\textrm{dist}\,}})$ in such a way that for all $\ell \le k$, the clustering ${\mathscr {H}}_{\ell -1}$ consists exactly of the $\ell$-graphs that make up $P_k$.

Proof

We again prove the claim by induction. Complete linkage always starts with every point in a separate cluster. Since those are exactly the 1-graphs and any merge costs at least 1, the claim follows for $\ell = 1$. Suppose now that ${\mathscr {H}}_{\ell - 1}$ consists exactly of the $\ell$-graphs of the instance. Since we are dealing with integer weights, any new merge increases the cost by at least 1 and so we may merge all pairs of $\ell$-graphs that form the $(\ell +1)$-graphs. These are cheapest merges as they altogether increase the cost from $\ell -1$ to $\ell$ (see Lemma 3). To finish the proof we are left to show that at this point there are no more free merges left. Take any two $(\ell + 1)$-graphs $G_{\ell + 1} \ne G_{\ell + 1}'$ contained in the current clustering. If they do not exactly cover the same levels, then the distance between the point in the lowest level to the point in the highest level is strictly more than $\ell$ by Lemma 2. Hence, we can assume that they share the same levels, say level $\lambda$ up to level $\ell + \lambda$. Denote by s the unique point in $V(G_{\ell + 1})$ with $\phi _k(s) = \lambda$ and by t the unique point in $V(G_{\ell + 1}')$ with $\phi _k(t) = \ell + \lambda$. A shortest path connecting s and t must contain an edge $\{u,w\}$ with $u\in V(G_{\ell +1})$ and $w\in V(P_k)\backslash V(G_{\ell +1})$. Such an edge either weights at least $\ell +1$ or weights 1 and connects points in the same level, i.e., $\phi _k(u)=\phi _k(w)$. In the first case we directly obtain ${{\,\textrm{dist}\,}}(s,t)\ge \ell +1$. In the second case we use Lemma 2 and obtain

$$\begin{aligned} {{\,\textrm{dist}\,}}(s,t)&={{\,\textrm{dist}\,}}(s,u)+{{\,\textrm{dist}\,}}(u,w)+{{\,\textrm{dist}\,}}(w,t)\\&\ge |\phi _k(s)-\phi _k(u)|+1+|\phi _k(w)-\phi _k(t)|\\&=|\phi _k(s)-\phi _k(t)|+1\\&=\ell +1. \end{aligned}$$

It follows that ${\mathscr {H}}_{\ell }$ consists exactly of the $(\ell + 1)$-graphs that make up $P_k$. $\square$

Proof

(Theorem 2 (diameter)) Lemma 4 shows that ${\mathscr {H}}_{k-1}$ can consist of all the k-graphs that make up $P_k$. There are exactly $k+1$ of them and so there is one merge remaining to get a k-clustering. By definition of ${\mathscr {H}}_{k-1}$, this last merge increases the cost by at least 1 and so the k-clustering produced by complete linkage costs at least k, whereas the optimal clustering consisting of the k individual levels costs 1. $\square$

4.2 A lower bound for radius-based costs

We show that the instance $(V(P_k),{{\,\textrm{dist}\,}})$ also yields a lower bound of k/2 for radius-based costs. This requires some additional work, as we now also have to keep track of the centers that induce an optimal radius. For an $\ell$-graph $G_{\ell }$ we again slightly abuse the notation and talk about $G_{\ell }$ as a cluster with ${{\,\textrm{cost}\,}}(G_{\ell })=\min _{c\in V(G_{\ell })}\max _{x\in V(G_{\ell })}{{\,\textrm{dist}\,}}(c,x)$, the radius of $V(G_{\ell })$.

To prove Lemma 5 we show that there is a point in $P_k$ for which the following holds:

For all but one of the $\ell$-graphs that constitute $G_{2\ell }$ we can find a point that we can reach by an edge of weight 1. Since the diameter of these graphs is $\ell - 1$, this is sufficient.
The remaining $\ell$-graph lies in the same $(\ell + 1)$-graph as our point and so we are again done by considering the diameter. Also there are no points that induce a smaller radius, since the diameter of $G_{2\ell }$ is already $2\ell - 1$.

Lemma 5

Let $G_{2\ell }$ be any of the $2\ell$-graphs that constitute $P_k$ for $1 \le \ell \le \frac{k}{2}$ arbitrary. Then it holds that ${{\,\textrm{cost}\,}}(G_{2\ell }) = \ell$ and furthermore, all optimal centers that induce this cost are themselves already contained in $G_{2\ell }$ (and not in any other $2\ell$-graph).

Proof

By Lemma 3 we know that the diameter of $G_{2\ell }$ is $2\ell -1$. Thus the radius of $G_{2\ell }$ is at least $\ell$. To show the upper bound of $\ell$ suppose that $G_{2\ell }$ covers the levels $\lambda$ up to $\lambda + 2\ell - 1$ in $P_k$. Consider the unique $(\ell +1)$-graph $H_{\ell +1}$ contained in $G_{2\ell }$ covering the levels $\lambda +\ell -1$ to $\lambda +2\ell -1$. Let c be the unique point in $H_{\ell +1}$ with level $\lambda +\ell -1$. By Lemma 3 the diameter of $H_{\ell +1}$ is $\ell$, so any point in $H_{\ell +1}$ is at distance $\le \ell$ to c. Consider now a point $x\in V(G_{2\ell })\backslash V(H_{\ell +1})$ and the $\ell$-graph $H_{\ell }$ containing x. We claim that $H_{\ell }$ contains a point y with level $\lambda +\ell -1$. If this is not true then $H_{\ell }$ covers the levels $\lambda +\ell$ up to $\lambda +2\ell -1$ and therefore also contains the unique point in $G_{2\ell }$ with level $\lambda +2\ell -1$. This is not possible as the unique point in $G_{2\ell }$ with level $\lambda +2\ell -1$ is already contained in $H_{\ell +1}$. So using that the diameter of $H_{\ell }$ is $\ell -1$ and $\phi _k(c)=\phi _k(y)$ we obtain

$$\begin{aligned} {{\,\textrm{dist}\,}}(c,x)\le {{\,\textrm{dist}\,}}(c,y)+{{\,\textrm{dist}\,}}(y,x)\le 1+(\ell -1)=\ell . \end{aligned}$$

Now we prove that all optimal centers must be contained in $G_{2\ell }$. For all points $c\in V(P_k)\backslash V(G_{2\ell })$ we have to show that $\max _{x' \in V(G_{2\ell })} {{\,\textrm{dist}\,}}(c, x') \ge \ell +1$. Suppose that $\phi _k(c)\le \lambda + \ell -1$. Let x be the unique point in $G_{2\ell }$ with level $\lambda +2\ell -1$, we claim that ${{\,\textrm{dist}\,}}(c,x)\ge \ell +1$. Consider a shortest path between c and x and let $\{u,w\}$ be an edge on this path with $u\in V(P_k)\backslash V(G_{2\ell })$ and $w\in V(G_{2\ell })$. By construction $\{u,w\}$ either weights at least $2\ell$ in which case

$$\begin{aligned} {{\,\textrm{dist}\,}}(c,x)\ge 2\ell \ge \ell +1 \end{aligned}$$

or it weights 1 and $\phi _k(u)=\phi _k(w)$, so

$$\begin{aligned} {{\,\textrm{dist}\,}}(c,x)&={{\,\textrm{dist}\,}}(c,u)+{{\,\textrm{dist}\,}}(u,v)+{{\,\textrm{dist}\,}}(v,x)\\&\ge |\phi _k(c)-\phi _k(u)|+1+|\phi _k(w)-\phi _k(x)|\\&=|\phi _k(c)-\phi _k(x)|+1\\&\ge \ell +1. \end{aligned}$$

In case $\phi _k(c)\ge \lambda + \ell$ we can prove analogously that ${{\,\textrm{dist}\,}}(c,y)\ge \ell +1$ for the unique point y in $G_{2\ell }$ with level $\lambda$. This finishes the proof. $\square$

Now we make sure that complete linkage completely reconstructs these components. In particular we show that merging $2\ell$-graphs which cover the same levels increases the cost of our solution. Here we make use of the fact that sets of optimal centers for any pair of $2\ell$-graphs do not intersect. Lemma 6 ensures that the cost indeed increases.

Lemma 6

Let C, D be two subsets of $V(P_k)$ with ${{\,\textrm{cost}\,}}(C)={{\,\textrm{cost}\,}}(D)$. Let Z(C) and Z(D) denote the set of all optimal centers for C respectively D. If $Z(C)\cap Z(D)=\emptyset$ then ${{\,\textrm{cost}\,}}(C\cup D)>{{\,\textrm{cost}\,}}(C)$.

Proof

Let $x\in V(P_k)$. Since $Z(C)\cap Z(D)=\emptyset$ this point can be an optimal center for at most one of the sets. Assume without loss of generality that $x\notin Z(D)$. We have

$$\begin{aligned} \max _{y\in C\cup D}{{\,\textrm{dist}\,}}(y,x) \ge \max _{y\in D}{{\,\textrm{dist}\,}}(y,x) >{{\,\textrm{cost}\,}}(D)={{\,\textrm{cost}\,}}(C) \end{aligned}$$

So we have for all $x\in V(P_k)$ that $\max _{y\in C\cup D}{{\,\textrm{dist}\,}}(y,x)>{{\,\textrm{cost}\,}}(C)$ which proves the lemma. $\square$

Now, with this we can prove that the merging behavior of complete linkage reconstructs our components. Observe that Theorem 2 is an immediate consequence of Corollary 1.

Corollary 1

Complete linkage might merge clusters in $(V(P_k),{{\,\textrm{dist}\,}})$ in such a way, that for $1 \le \ell \le \frac{k}{2}$, the clustering ${\mathscr {H}}_{\ell }$ consists exactly of the $2\ell$-graphs that make up $P_k$.

Proof

The proof is an analogous induction to Lemma 4. Consider the case $\ell = 1$. The first merge increases the cost to 1. Observe by Lemma 5 that the cost of a 2-graph is 1. Furthermore, the same lemma shows that the sets of optimal centers for any pair of 2-graphs do not intersect and so, as shown in Lemma 6 any further merge necessarily has to increase the cost. Hence ${\mathscr {H}}_1$ consists exactly of the 2-graphs.

Assume now that the claim holds for ${\mathscr {H}}_\ell$. The induction step works essentially the same as the base case. Any merge will increase the cost of the solution by at least 1 by definition of ${\mathscr {H}}_{\ell }$ and so we might as well merge all $2\ell$-graphs that together compose a $(2\ell + 2)$-graph as this is a cheapest choice (Lemma 5). Furthermore, any additional merge would increase the cost to at least $\ell + 2$ (again by Lemma 6) and so ${\mathscr {H}}_{\ell + 1}$ consist of the $(2\ell + 2)$-graphs. $\square$

Notice that in our analysis we decided which clusters will be merged by complete linkage whenever it has to choose between two merges of the same cost. However with some adjustments on the instance $P_k$ we can show a lower bound of $\Omega (k)$ for both, diameter and radius, for any behavior of complete linkage on ties. For more details we refer to Appendix A.

5 An upper bound for complete linkage

Even though complete linkage is often used when it comes to computing a hierarchical clustering, there are no known non-trivial upper bounds for its approximation guarantee in general metric spaces, to the best of the authors’ knowledge. We give an upper bound for complete linkage for hierarchical k-center and hierarchical k-diameter.

5.1 An upper bound for radius-based cost

We show that the approximation ratio of the radius of any k-clustering ${\mathscr {C}}_k$ produced by complete linkage relative to an optimal k-center clustering is in O(k).

Theorem 3

Let $({\mathscr {C}}_k)_{k=1}^{n}$ be the hierarchical clustering computed by complete linkage on $(P,{{\,\textrm{dist}\,}})$ optimizing the radius. For all $1 \le k \le n$ the radius ${{\,\textrm{cost}\,}}({\mathscr {C}}_k)$ is upper bounded by $O(k){{\,\textrm{cost}\,}}({\mathscr {O}}_k)$, where ${\mathscr {O}}_k$ is an optimal k-center clustering.

To simplify the notation we fix an arbitrary k and assume that the optimal k-clustering ${\mathscr {O}}= {\mathscr {O}}_k$ has cost ${{\,\textrm{cost}\,}}({\mathscr {O}}) = \frac{1}{2}$. The latter is possible without loss of generality by scaling the metric appropriately.

We split the proof of Theorem 3 into two parts. In the first, we derive a crude upper bound for the increasing cost of clusterings produced during the execution of complete linkage. This part follows from Ackermann et al. (2014), who use the same bound to estimate the cost of some few merge steps. Proposition 1 shows that the difference in cost between ${\mathscr {C}}_k$ and ${\mathscr {C}}_t$ for $t > k$ is at most $\lceil \log (t - k) \rceil + 1$. That is, ${{\,\textrm{cost}\,}}({\mathscr {C}}_k) \le \lceil \log (t - k) \rceil + 1 + {{\,\textrm{cost}\,}}({\mathscr {C}}_t)$ holds for all $1 \le k < t \le n$. A clustering ${\mathscr {C}}_t$ whose cost we can estimate directly (i.e. without refering to any other clustering) thus yields a proper upper bound for ${{\,\textrm{cost}\,}}({\mathscr {C}}_k)$. Ideally, this clustering should consist of relatively few clusters (so that $\lceil \log (t - k) \rceil$ is small), while at the same time not being too expensive. Of course, however, these criteria oppose each other. Naively choosing the initial clustering ${\mathscr {C}}_t = {\mathscr {C}}_n$ is not good enough. Although its cost is minimal, the number of clusters is too high, only yielding an upper bound of ${{\,\textrm{cost}\,}}({\mathscr {C}}_k) \le \lceil \log (n - k) \rceil + 1$. In the second part of the proof we thus set out to find a different clustering to start from.

5.1.1 Part 1: an estimate of the relative difference in cost

When dealing with radii, any merge done by complete linkage previous to reaching a k-clustering increases the cost by at most $2 {{\,\textrm{cost}\,}}({\mathscr {O}}) = 1$ (Fig. 1). This is due to the fact that the centers of two of those clusters are contained in the same optimal cluster.

We show that complete linkage clusterings at times $t_{\le x}$ and $t_{\le x + 1}$ can have at most k clusters in common. All other clusters from ${\mathscr {H}}_x$ are merged in ${\mathscr {H}}_{x+1}$.

Lemma 7

For all $x \ge 0$ the clustering ${\mathscr {H}}_{x+1}$ contains at most k clusters of cost at most x. In particular, it holds that $|{\mathscr {H}}_{x + 1} \cap {\mathscr {H}}_x| \le k$.

Proof

Assume on the contrary that there exist $k + 1$ pairwise different clusters $D_1, \ldots , D_{k+1}$ at time $t_{\le x + 1}$ of cost at most x. Denote by $d_i \in D_i$ a point that induces the smallest radius, i.e. ${{\,\textrm{cost}\,}}(D_i) = \max _{d \in D_i} {{\,\textrm{dist}\,}}(d, d_i)$ for all i. Then two of these points, say $d_1$ and $d_2$, have to be contained in the same optimal cluster $O \in {\mathscr {O}}$. Hence, we know that

$$\begin{aligned} {{\,\textrm{cost}\,}}(D_1 \cup D_2) \le 1 + \max _{i \in \{1,2\}} {{\,\textrm{cost}\,}}(D_i) \le 1 + x \end{aligned}$$

because ${{\,\textrm{dist}\,}}(d_1, d_2) \le 2 {{\,\textrm{cost}\,}}(O) \le 2{{\,\textrm{cost}\,}}({\mathscr {O}})=1$ and ${{\,\textrm{cost}\,}}(D_i)\le x$ for $i=1,2$. This contradicts the definition of ${\mathscr {H}}_{x + 1}$, as $D_1$ and $D_2$ can still be merged without pushing the cost beyond $x + 1$. $\square$

With this we can upper bound $|{\mathscr {H}}_{x+i}|$ in terms of $|{\mathscr {H}}_x|$ for all $i \in {\mathbb {N}}$.

Corollary 2

For all $i \in {\mathbb {N}}_+$ and $x \ge 0$ it holds that $|{\mathscr {H}}_{x + i}| \le k + \frac{1}{2^{i}}(|{\mathscr {H}}_x| - k)$.

Proof

First, we consider what happens when we increase the cost by 1. We fix an arbitrary $x' \ge 0$. Lemma 7 shows that at most k clusters from ${\mathscr {H}}_{x'}$ are left untouched, while the remaining $|{\mathscr {H}}_{x'}| - k$ clusters have to be merged with at least one other cluster (thus at least halving the number of those clusters) to get to ${\mathscr {H}}_{x' + 1}$. This yields a bound of

$$\begin{aligned} |{\mathscr {H}}_{x' + 1}| \le k + \frac{1}{2}(|{\mathscr {H}}_{x'}| - k). \end{aligned}$$

Now, the case for general $i \in {\mathbb {N}}$ follows by a straightforward induction. We have just shown that the claim is true for $i = 1$, where we set $x' = x$. For the induction step suppose that

$$\begin{aligned} |{\mathscr {H}}_{x + i-1}| \le k + \frac{1}{2^{i-1}}(|{\mathscr {H}}_x| - k). \end{aligned}$$

Substituting this into the inequality

$$\begin{aligned} |{\mathscr {H}}_{x + i}| \le k + \frac{1}{2}(|{\mathscr {H}}_{x + i - 1}| - k), \end{aligned}$$

derived from the first part of our proof with $x' = x + i -1$, yields

$$\begin{aligned} |{\mathscr {H}}_{x + i}| \le k + \frac{k + \frac{1}{2^{i-1}}(|{\mathscr {H}}_x| - k) - k}{2} = k + \frac{1}{2^{i}}(|{\mathscr {H}}_x| - k) \end{aligned}$$

as claimed. $\square$

Proposition 1

For all $k < t \le n$ it holds that ${{\,\textrm{cost}\,}}({\mathscr {C}}_k) \le \lceil \log (t - k) \rceil + 1 + {{\,\textrm{cost}\,}}({\mathscr {C}}_t)$.

Proof

Let $x = {{\,\textrm{cost}\,}}({\mathscr {C}}_t)$, so that ${\mathscr {H}}_x$ consists of at most t clusters. Applying Corollary 2 with $i = \lceil \log (t - k) \rceil + 1$ then shows that

$$\begin{aligned} |{\mathscr {H}}_{x + i}| < k + \frac{1}{t-k}(|{\mathscr {H}}_x| - k) \le k + 1. \end{aligned}$$

That is, ${\mathscr {H}}_{x + i}$ emerges from ${\mathscr {C}}_k$ by merging some (or none) of its clusters and we can conclude that ${{\,\textrm{cost}\,}}({\mathscr {C}}_k) \le {{\,\textrm{cost}\,}}({\mathscr {H}}_{x + i}) \le x + i = {{\,\textrm{cost}\,}}({\mathscr {C}}_t) + \lceil \log (t - k) \rceil + 1.$ $\square$

5.1.2 Part 2: a cheap clustering with few clusters

Suppose that there exists a complete linkage clustering ${\mathscr {C}}_t$ for some $t > k$ with $t\in O(2^k)$ clusters and ${{\,\textrm{cost}\,}}({\mathscr {C}}_t) \in O(k)$. Then applying Proposition 1 shows that

$$\begin{aligned} {{\,\textrm{cost}\,}}({\mathscr {C}}_k) \in \log (O(2^k)) + 1 + O(k) = O(k) = O(k) {{\,\textrm{cost}\,}}({\mathscr {O}}) \end{aligned}$$

and Theorem 3 is proven (recall that ${{\,\textrm{cost}\,}}({\mathscr {O}}) = \frac{1}{2}$). We show that ${\mathscr {C}}_t = {\mathscr {H}}_{4k+2}$ is a sufficiently good choice. To estimate the size of ${\mathscr {H}}_{4k+2}$, we distinguish between active and inactive clusters. Remember that ${\mathscr {O}}_C=\{O\in {\mathscr {O}}\mid O\cap C\ne \emptyset \}$ is the set of optimal clusters hit by C.

Definition 1

We call a cluster $C \in {\mathscr {H}}_x$ active, if ${{\,\textrm{cost}\,}}(C) \le 4 \cdot |{\mathscr {O}}_C|$, or if there exists $1\le i<x$ and a cluster $C' \in {\mathscr {H}}_{x-i}$ such that ${\mathscr {O}}_C = {\mathscr {O}}_{C'}$ and ${{\,\textrm{cost}\,}}(C')\le 4\cdot |{\mathscr {O}}_{C'}|$. Otherwise, C is called inactive.

Notice that the definition of an active cluster C directly implies that ${{\,\textrm{cost}\,}}(C)\le 4|{\mathscr {O}}_C|+1$.

The behavior that makes complete linkage more difficult to analyze than single linkage is that the former sometimes merges clusters that are quite far apart. That is, contrary to single linkage, complete linkage can produce clusters that are very expensive relative to the number of optimal clusters hit by them. We mark such clusters as inactive and count them directly the first time they are created. We will see that the number of such clusters is small. However the number of active clusters is potentially large, but if the cost of the clustering reaches $4k+2$, this number can also be bounded as we see in the following lemma.

Lemma 8

There are at most $2^k$ active clusters in ${\mathscr {H}}_{4k+2}$.

Proof

Notice that at time $t_{\le 4k+2}$ there cannot exist two active clusters $C_1$ and $C_2$ with ${\mathscr {O}}_{C_1} \subseteq {\mathscr {O}}_{C_2}$. Indeed, since $C_2$ hits all the optimal clusters hit by $C_1$ we get that

$$\begin{aligned} {{\,\textrm{cost}\,}}(C_1 \cup C_2) \le {{\,\textrm{cost}\,}}(C_2) + 1 \le 4 |{\mathscr {O}}_{C_2}| + 2 \le 4k + 2 \end{aligned}$$

where the second inequality follows from the fact that $C_2$ is active and therefore ${{\,\textrm{cost}\,}}(C_2)\le 4|{\mathscr {O}}_{C_2}|+1$. We conclude that $C_1$ and $C_2$ would have been merged in ${\mathscr {H}}_{4k+2}$. Now, if there are more than $2^k$ active clusters in ${\mathscr {H}}_{4k+2}$, then at least two of them must hit exactly the same set of optimal clusters. Since we have just ruled this out, the lemma follows. $\square$

We estimate the number of inactive clusters by looking at the circumstances under which they arise. As it happens, at each step there are not many clusters whose merge yields an inactive cluster.

Lemma 9

There are at most $4k^2 + k$ inactive clusters in ${\mathscr {H}}_{4k+2}$.

Proof

Let $m_x$ be the number of inactive clusters in ${\mathscr {H}}_x$. We show that the recurrence relation $m_x \le m_{x-1} + k$ holds for any $x \in {\mathbb {N}}$. In that case $m_{4k+2} \le (4k+1)k = 4k^2 + k$ since $m_1 = 0$ and we are done.

To prove the recurrence relation first fix some arbitrary $x \in {\mathbb {N}}$ and let $D \in {\mathscr {H}}_x$ be an inactive cluster. Let $D_1, \ldots , D_\ell \in {\mathscr {H}}_{x-1}$ be the clusters whose merge results in D. We show that none of them can be active at time $t_{\le x - 1}$ and have cost at least $x-2$. Since this only leaves few possible clusterings, we get the recurrence inequality given above. Suppose that for one of the clusters, say $D_i$, it holds that $D_i$ is active and ${{\,\textrm{cost}\,}}(D_i) \ge x - 2$. Right away, notice that $|{\mathscr {O}}_{D_i}| < |{\mathscr {O}}_D|$ since otherwise D would also be active by definition. But then

$$\begin{aligned} {{\,\textrm{cost}\,}}(D) \le x \le {{\,\textrm{cost}\,}}(D_i) + 2 \le 4|{\mathscr {O}}_{D_i}| + 3 < 4(|{\mathscr {O}}_{D_i}| + 1) \le 4 |{\mathscr {O}}_D| \end{aligned}$$

contradicts the assumption of D being inactive. As such, we know that all $D_i$ ($i=1,\ldots ,\ell$) must be inactive or have cost less than $x - 2$. In other words, each inactive cluster in ${\mathscr {H}}_x$ descends from the set

$$\begin{aligned} \{D \in {\mathscr {H}}_{x-1} \, | \, D \text { is inactive}\} \cup \{D \in {\mathscr {H}}_{x-1} \, | \, {{\,\textrm{cost}\,}}(D) < x - 2\}. \end{aligned}$$

The cardinality of the set on the left is $m_{x-1}$ and, by Lemma 7, the cardinality of the set on the right is at most k. This proves the claim. $\square$

Corollary 3

${\mathscr {H}}_{4k+2}$ consists of at most $2^k + 4k^2 + k$ clusters.

Notice that Theorem 3 is an immediate consequence of Corollary 3 and Proposition 1.

5.2 An upper bound for diameter-based cost

The main challenge in proving an upper bound on the approximation guarantee of complete linkage when replacing the k-center objective by the k-diameter objective is to deal with the possibly large increase of cost after a merge step. When we perform complete linkage for the k-center objective, complete linkage roughly halves the number of clusters while the cost increases by a constant amount and this is repeated as long as the number of clusters is larger than k (see Corollary 2). This is an easy conclusion from the fact that whenever the centers of two clusters are contained in the same optimal cluster merging the two clusters increases the cost only by the cost of the optimal clustering. If we try to apply this insight to analyze complete linkage for the k-diameter objective we now consider the merge of two clusters which intersect the same optimal cluster. As we see in Fig. 1 merging these two clusters can double the cost in the worst case, therefore we are not able to prove a similar statement to Corollary 2 for k-diameter. In conclusion when dealing with the diameter we ignore Part 1 of the analysis presented in Sect. 5.1 for the radius and instead follow some ideas of Part 2 in Sect. 5.1 where we divide clusters constructed by complete linkage in active and inactive clusters. Even though we substantially change the definition of inactive and active clusters the main idea stays the same: the cost of active clusters can be upper bounded nicely while we guarantee that there are not too many inactive clusters. A main difference to Part 2 of the analysis for the radius is now that the total number of active and inactive clusters must be upper bounded by k instead of $O(2^k)$ which yields the increase in the approximation factor for diameter.

We now give a brief overview over the ideas used to upper bound the approximation factor of complete linkage for the diameter. For some arbitrary but fixed k let ${\mathscr {O}}$ denote an optimal k-diameter solution and assume that ${{\,\textrm{cost}\,}}({\mathscr {O}})=1$ from now on. Consider the clustering ${\mathscr {H}}_1$ computed by complete linkage at time $t_{\le 1}$. Observe that every optimal cluster can fully contain at most one cluster from ${\mathscr {H}}_1$, as the union of such clusters would cost at most 1. Now, consider the graph $G=(V,E)$ with $V={\mathscr {O}}$ and edges $\{A,B\} \subset V$ for every cluster $C \in {\mathscr {H}}_1$ intersecting A and B. If there is such an edge $\{A, B\}$, then the cost of merging A and B is upper bounded by 3. We can go even further and consider the merge of all optimal clusters in a connected component of G. Suppose the size of the connected component is m, then the resulting cluster costs at most $2m-1$. There are two extreme cases in which we could end up: if $E=\emptyset$, then ${\mathscr {H}}_1 = {\mathscr {O}}$ and complete linkage has successfully recovered the optimal solution. On the other hand, if G is connected, then merging all points costs at most $2k-1$ and we get an O(k)-approximative solution. The remaining cases are more difficult to handle. We proceed by successively adding edges between optimal clusters, while maintaining the property that for a connected component Z in G merging $\cup _{A\in V(Z)}A$ costs at most $|V(Z)|^{\ln (3)/\ln (2)}$. This leads to an upper bound of $\lceil k^{\ln (3)/\ln (2)}\rceil$ for all clusters C constructed by complete linkage with $C\subset \cup _{A\in V(Z)}A$. We call such clusters active clusters. All clusters which do not admit this property are called inactive clusters. We show that the number of inactive clusters is sufficiently small, such that in the end, we are able to prove that ${\mathscr {H}}_{\lceil k^{\ln (3)/\ln (2)}\rceil }$ consists of at most k clusters. This immediately leads the following theorem.

Theorem 4

Let $({\mathscr {C}}_k)_{k=1}^{n}$ be the hierarchical clustering computed by complete linkage on $(P,{{\,\textrm{dist}\,}})$ optimizing the diameter. For all $1 \le k \le n$ the diameter ${{\,\textrm{cost}\,}}({\mathscr {C}}_k)$ is upper bounded by $\lceil k^{\ln (3)/\ln (2)}\rceil {{\,\textrm{cost}\,}}({\mathscr {O}}_k)$, where ${\mathscr {O}}_k$ is an optimal k-diameter clustering.

Let $\alpha = \ln (3)/\ln (2)$ from now on. Essential for this section is a sequence of cluster graphs $G_t=(V_t, E_t)$ for $t = 1, \ldots , \lceil k^{\alpha }\rceil$ constructed directly on the set $V_t = {\mathscr {O}}$ of optimal k-clusters. We start with the cluster graph $G_1$ that contains edges $\{A, B\}$ for every two vertices $A, B \in V_1 = {\mathscr {O}}$ that are hit by a common cluster from ${\mathscr {H}}_1$. To this we successively add edges based on a vertex labeling in order to create the remaining cluster graphs $G_2, \ldots , G_{ \lceil k^{\ln (3)/\ln (2)}\rceil }$. The labeling distinguishes vertices as being either active or inactive. We denote the set of active vertices in $V_t$ by $V_t^a$ and the set of inactive ones by $V_t^i$. In the beginning ($t = 1$) the inactive vertices are set to precisely those that are isolated: $V_1^i = \{O \in V_1 \, | \, \delta _{G_1}(O) = \emptyset \}$. For $t\ge 2$, the labeling is outlined in Definition 2. Over the course of time, active vertices may become inactive, but inactive vertices never become active again.

Given a labeling for $V_{t+1}$, we construct $G_{t+1}$ from $G_{t}$ by adding additional edges: If there are two active vertices $A, B \in V_{t+1}^a$ that are both hit by a common cluster from ${\mathscr {H}}_{t+1}$, we add an edge $\{A, B\}$ to $E_{t+1}$.

Definition 2

Let $t\ge 1$ and $A \in V_{t+1}$ be an arbitrary optimal cluster and $Z_A$ the connected component in $G_{t}$ that contains A. We call A inactive (i.e., $A\in V_{t+1}^i$ ) if $\lceil {{\,\textrm{cost}\,}}(Z_A)\rceil \le t$, and active otherwise. Here, and in the following ${{\,\textrm{cost}\,}}(Z_A)={{\,\textrm{cost}\,}}(\bigcup _{B\in V(Z_A)} B)$ denotes the cost of merging all optimal clusters contained in $V(Z_A)$.

Thus if a connected component in $G_{t}$ has small cost, then all vertices in this component become inactive in $G_{t+1}$ by definition. We state the following useful properties of inactive vertices in $(G_t)_{t=1}^{ \lceil k^{\alpha }\rceil }$.

Lemma 10

If Z is a connected component in $G_{t+1}$ with $V(Z) \cap V_{t+1}^i \ne \emptyset$, then

1.
Z is also a connected component in $G_t$ and $\lceil {{\,\textrm{cost}\,}}(Z)\rceil \le t$,
2.
we have $V(Z) \subseteq V_{t+1}^i$, i.e., all vertices in Z become inactive at the same time.

Moreover we have $V_{t}^i \subseteq V_{t+1}^i$, so once vertices become inactive, they stay inactive. Equivalently, $V_{t+1}^a \subseteq V_t^a$.

Proof

Take any inactive vertex $A \in V_{t+1}^i \cap V(Z)$ and consider the connected component $Z_A$ in $G_t$ containing A. By Definition 2, we have that $\lceil {{\,\textrm{cost}\,}}(Z_A)\rceil \le t$ and so all other vertices in $Z_A$ have to be in $V_{t+1}^i$ as well. We observe that $E_{t+1} \setminus E_t$ only contains edges between vertices from $V_{t+1}^a$ by construction. This shows $Z = Z_A$.

It is left to show that inactive vertices stay inactive. For $t=1$ the inactive vertices $V_1^i$ are already connected components with cost at most 1. As such, they remain inactive at step $t = 2$. For $t \ge 2$, consider an inactive vertex $A \in V_t^i$ and the connected component $Z \subseteq G_t$ containing it. We showed previously that $V(Z)\subset V_t^i$ and so Z is also a connected component in $G_{t+1}$ with $\lceil {{\,\textrm{cost}\,}}(Z) \rceil \le t - 1 < t$ and thus $A\in V(Z)\subset V_{t+1}^i$. $\square$

Definition 3

Let $C\in {\mathscr {H}}_t$ for some fixed $t\in {\mathbb {N}}$. We define ${\mathscr {I}}_t=\{C\in {\mathscr {H}}_t\mid {\mathscr {O}}_C\cap V_t^i\ne \emptyset \}$ as the set of all clusters in ${\mathscr {H}}_t$ which hit at least one inactive vertex of $G_t$. We call these clusters inactive and all clusters from ${\mathscr {H}}_t\backslash {\mathscr {I}}_t$ active.

We prove the following easy property about active clusters.

Lemma 11

If $C \in {\mathscr {H}}_t \setminus {\mathscr {I}}_t$, then $G_t[{\mathscr {O}}_C]$ forms a clique. In particular there exists a connected component in $G_t$ that fully contains ${\mathscr {O}}_C$.

Proof

By definition of ${\mathscr {I}}_t$, ${\mathscr {O}}_C$ must consist exclusively of active vertices. Since all of them are hit by $C \in {\mathscr {H}}_t$ there exists an edge $\{A, B\} \in E_t$ for every pair $A, B \in {\mathscr {O}}_C$. In other words, $G_t[{\mathscr {O}}_C]$ forms a clique and the claim follows. $\square$

This does not necessarily hold for an inactive cluster $C\in {\mathscr {I}}_t$. As C contains at least one inactive vertex, the connected component Z which contains this vertex does not grow. If later on complete linkage merges C with another cluster the result is an inactive cluster which may hit vertices outside of Z. So $G_{t'}$ does not reflect the progression of C for $t'\ge t$. However, the number of such clusters cannot exceed $|V_t^i|$.

Lemma 12

The number of inactive clusters in ${\mathscr {H}}_t$ is at most the number of inactive vertices at time t. That is, $|{\mathscr {I}}_t| \le |V_t^i|$ holds for all $t \in {\mathbb {N}}$.

Proof

We prove the claim by showing that the following inductive construction defines a family of injective mappings $\phi _t: {\mathscr {I}}_t \rightarrow V_t^i$:

Let $C \in {\mathscr {I}}_1$ be an inactive cluster. By definition C thus has to intersect an inactive optimal cluster $A \in V_1^i$. Actually, there can only be one such cluster, as any other optimal cluster that is hit would induce an edge incident to A in $G_1$, making it active. Set $\phi _1(C) = A$, so that ${\mathscr {O}}_C = \{\phi _1(C)\}$.
For $t>1$ and $C\in {\mathscr {I}}_t$ we distinguish two cases: If there is no cluster in ${\mathscr {I}}_{t-1}$ that is a subset of C, we pick an arbitrary but fixed $A \in {\mathscr {O}}_C \cap V_t^i$ and set $\phi _t(C)=A$. Otherwise, we know that C must descend from some cluster $D \in {\mathscr {I}}_{t-1}$ and we can set $\phi _t(C) = \phi _{t-1}(D)$. Since $\phi _{t-1}(D) \in V_{t-1}^i \subset V_t^i$ by Lemma 10, this shows that $\phi _t$ really maps into $V_t^i$.

Suppose that there exist two inactive clusters $C, D \in {\mathscr {I}}_1$ that are mapped to the same inactive vertex $A \in V_1^i$. Then, by the construction of $\phi _1$, ${\mathscr {O}}_C = \{A\} = {\mathscr {O}}_D$ shows that C and D are actually fully contained in the same optimal cluster. The optimal cluster has diameter at most 1 and so C and D would have already been merged in ${\mathscr {H}}_1$. As this is not possible, $\phi _1$ has to be injective.

Now, let $t \ge 2$ be arbitrary and assume $\phi _{t-1}$ to be injective. We show that in that case $\phi _t$ also has to be injective. Suppose on the contrary, that there exist two different clusters $C, D \in {\mathscr {I}}_t$ with $\phi _t(C) = \phi _t(D)$. We distinguish three cases.

Case 1::: Both C and D descend from (i.e., contain) clusters $C', D' \in {\mathscr {I}}_{t-1}$ with $\phi _t(C) = \phi _{t-1}(C')$ and $\phi _t(D) = \phi _{t-1}(D')$, respectively. Then $\phi _{t-1}(C') = \phi _t(C) = \phi _t(D) = \phi _{t-1}(D')$ entails that $C' = D'$, since $\phi _{t-1}$ is assumed to be injective. Clearly, $C' = D'$ cannot end up being a subset of two different clusters in ${\mathscr {I}}_t$ and so we end up in a contradiction.
Case 2::: Neither C nor D descend from a cluster in ${\mathscr {I}}_{t-1}$. In other words, C and D fully descend from clusters in ${\mathscr {H}}_{t-1} \setminus {\mathscr {I}}_{t-1}$ and so there exist clusters $C', D' \in {\mathscr {H}}_{t-1} \setminus {\mathscr {I}}_{t-1}$ contained in C and D, respectively, such that $A = \phi _t(C) = \phi _t(D) \in {\mathscr {O}}_{C'} \cap {\mathscr {O}}_{D'}$. Applying Lemma 11 yields the existence of a connected component Z in $G_{t-1}$ with $V(Z) \supset {\mathscr {O}}_{C'} \cup {\mathscr {O}}_{D'}$. We show that this connected component has cost at most $t-1$. In that case, $C'$ and $D'$ should have already been merged in ${\mathscr {H}}_{t-1}$; a contradiction. To show that ${{\,\textrm{cost}\,}}(Z) \le t-1$, consider the connected component $Z'$ in $G_t$ containing $A = \phi _t(C) = \phi _t(D) \in {\mathscr {O}}_C \cap {\mathscr {O}}_D \cap V_t^i$. Since A was chosen from a subset of $V_t^i$, we know from Lemma 10 that $Z'$ is also a connected component in $G_{t-1}$ with ${{\,\textrm{cost}\,}}(Z') \le t-1$. Now, $A \in V(Z) \cap V(Z')$ shows that $Z = Z'$ and so we are done.
Case 3::: D contains a cluster $D' \in {\mathscr {I}}_{t-1}$, so that $\phi _t(D) = \phi _{t-1}(D') \in V_{t-1}^i$, whereas C does not. (The symmetric case with the roles of C and D swapped is left out.) Since C fully descends from ${\mathscr {H}}_{t-1} \setminus {\mathscr {I}}_{t-1}$, we know that ${\mathscr {O}}_C \subseteq V_{t-1}^a$. But this already yields a contradiction: $V_{t-1}^a \ni \phi _t(C) = \phi _t(D) = \phi _{t-1}(D') \in V_{t-1}^i$.

This covers all possible cases, with each one ending in a contradiction. Hence $\phi _t$ has to be injective and by induction this holds for all $t \in {\mathbb {N}}$. $\square$

Active clusters from ${\mathscr {H}}_t$ are nicely represented by the graph $G_t$ as it is shown in Lemma 11. We can indirectly bound the cost of active clusters by bounding the cost of the connected components they are contained in.

Lemma 13

Let Z be a connected component in $G_t$. If $V(Z)\subset V_t^a$, we have $\lceil {{\,\textrm{cost}\,}}(Z) \rceil \le |V(Z)|^{\alpha }.$

Proof

Again, we prove this via an induction over t. For $t=1$ and $A,B\in V(Z)$ we want to upper bound the distance between $p\in A$ and $q\in B$. Let $A=Q_1,\ldots , Q_s=B$ be a simple path connecting A and B in Z. We know by definition of $G_1$ that for $j=1,\ldots , s-1$ there is a pair of points $p_j\in Q_j$ and $q_j\in Q_{j+1}$ with ${{\,\textrm{dist}\,}}(p_j,q_j)\le 1$. Using the triangle inequality we obtain

$$\begin{aligned} {{\,\textrm{dist}\,}}(p,q)&\le {{\,\textrm{dist}\,}}(p,p_1) + \sum _{j=1}^{s-2} \big ({{\,\textrm{dist}\,}}(p_j,q_j)+{{\,\textrm{dist}\,}}(q_j,p_{j+1})\big ) \\&\hspace{4mm} +{{\,\textrm{dist}\,}}(p_{s-1},q_{s-1})+{{\,\textrm{dist}\,}}(q_{s-1},q)\\&\le 2s-1. \end{aligned}$$

Here we use that $q_j$ and $p_{j+1}$ are in the same optimal cluster, thus the distance between those points is at most one.

Since V(Z) contains only active vertices we have $|V(Z)|\ge 2$. Using the above upper bound on the distance between two points in $\bigcup _{A\in V(Z)} A$ we obtain

$$\begin{aligned} \lceil {{\,\textrm{cost}\,}}(Z)\rceil \le 2|V(Z)|-1\le |V(Z)|^{\alpha }, \end{aligned}$$

where the last inequality follows from the fact that the function $h(x)=x^\alpha -2x+1$ is convex and $h(1)=h(2)=0$. Thus $h(x)\le 0$ for $x\in (1,2)$ and $h(x)\ge 0$ for $x\in {\mathbb {R}}_{\ge 0}\backslash (1,2)$.

For $t>1$ let $Z_1,\ldots ,Z_u$ denote the connected components in $G_{t-1}$ with $V(Z)=\bigcup _{j=1}^u V(Z_j)$. Let $j,j'\in \{1,\ldots ,u\}$. We observe that $V(Z_j)\subset V(Z)\subset V_t^a\subset V_{t-1}^a$. Thus we obtain by induction that

$$\begin{aligned} \lceil {{\,\textrm{cost}\,}}(Z_j)\rceil \le |V(Z_j)|^\alpha . \end{aligned}$$

(1)

Suppose that $\lceil {{\,\textrm{cost}\,}}(Z_j)\rceil \le t-1$. Then $V(Z_j)\subset V_t^i$ by definition, which is a contradiction to $V(Z)\cap V_t^i=\emptyset$. So we must have

$$\begin{aligned} t \le \lceil {{\,\textrm{cost}\,}}(Z_j)\rceil . \end{aligned}$$

(2)

Combining (1) and (2) we obtain

$$\begin{aligned} t\le |V(Z_{j})|^{\alpha }. \end{aligned}$$

(3)

For $A,B\in V(Z)$ we want to upper bound the distance between $p\in A$ and $q\in B$. Let $A=Q_1,\ldots , Q_{s}=B$ be a simple path connecting A and B in Z which enters and leaves every connected component $Z_j$ for $j\in \{1,\ldots , u\}$ at most once. We divide the path into several parts such that every part lies in one connected component from $\{Z_1,\ldots , Z_u\}$. Let $1= m_1<m_2<\ldots < m_{\ell }=s$ such that $Q_{m_j}\ldots , Q_{m_{j+1}-1}$ lie in one connected component $Z^{(j)}\in \{Z_1,\ldots , Z_u\}$ and $Z^{(j)}\ne Z^{(j+1)}$ for all $j\in \{1,\ldots , {\ell }\}$. Since $(Q_{m_j-1}, Q_{m_j})\in E_t$ we know that there exists a cluster in ${\mathscr {H}}_t$ that intersects $Q_{m_j-1}$ and $Q_{m_j}$, thus there is a pair of points $p_j\in Q_{m_j-1}$ and $q_j\in Q_{m_j}$ such that ${{\,\textrm{dist}\,}}(p_j,q_j)\le t.$ We obtain

$$\begin{aligned} \lceil {{\,\textrm{dist}\,}}(p,q) \rceil&\le \sum _{j=1}^{{\ell }-1} \big (\lceil {{\,\textrm{cost}\,}}(Z^{(j)})\rceil +\lceil {{\,\textrm{dist}\,}}(p_j,q_j)\rceil \big ) + \lceil {{\,\textrm{cost}\,}}(Z^{({\ell })})\rceil \\&\le (\ell -1) t+ \sum _{j=1}^{{\ell }} |V(Z^{(j)})|^\alpha \\&\le (\ell -1) \min _{1\le j\le \ell }|V(Z^{(j)})|^{\alpha }+ \sum _{j=1}^{{\ell }} |V(Z^{(j)})|^\alpha . \end{aligned}$$

For the second inequality we use (1) and ${{\,\textrm{dist}\,}}(p_j,q_j)\le t$. For the third inequality we use (3). Now it remains to show that

$$\begin{aligned} (\ell -1) \min _{1\le j\le \ell }|V(Z^{(j)})|^{\alpha }+ \sum _{j=1}^{{\ell }} |V(Z^{(j)})|^\alpha \le \Big (\sum _{j=1}^{\ell } |V(Z^{(j)})|\Big )^\alpha . \end{aligned}$$

For this purpose we assume without loss of generality that $|V(Z^{(1)})|\ge |V(Z^{(2)})|\ge \ldots \ge |V(Z^{(\ell )})|$ and define $x_j=\frac{|V(Z^{(j)})|}{|V(Z^{(1)})|}$ for $j=1,\ldots ,\ell$. We obtain the following equivalent inequality:

$$\begin{aligned} |V(Z^{(1)})|^{\alpha }\Big ((\ell -1) x_{\ell }^{\alpha }+ \sum _{j=1}^{{\ell }} x_j^\alpha \Big )\le |V(Z^{(1)})|^{\alpha }\Big (\sum _{j=1}^{\ell } x_j\Big )^\alpha . \end{aligned}$$

Since $1=x_1\ge \ldots \ge x_{\ell }\ge 0$ this follows directly from Lemma 14 and thus

$$\begin{aligned} \lceil {{\,\textrm{dist}\,}}(p,q) \rceil \le \Big (\sum _{j=1}^{\ell } |V(Z^{(j)})|\Big )^\alpha \le |V(Z)|^{\alpha }. \end{aligned}$$

We obtain $\lceil {{\,\textrm{cost}\,}}(Z)\rceil \le |V(Z)|^{\alpha }$ which proves the lemma. $\square$

Lemma 14

For $\ell \ge 1$ let

$$\begin{aligned} f(y_1,\ldots ,y_{\ell })=\big ((\ell -1) y_{\ell }^{\alpha }+ \sum _{j=1}^{{\ell }} y_j^\alpha \big )-\big (\sum _{j=1}^{\ell } y_j\big )^\alpha . \end{aligned}$$

We have $f(y_1,\ldots , y_{\ell })\le 0$ for all $1=y_1\ge y_2\ge \ldots \ge y_{\ell }\ge 0.$

Proof

Let $M=\{(a_1,\ldots , a_{\ell })\mid 1=a_1\ge a_2\ge \ldots \ge a_{\ell }\ge 0\}$. We have to prove that $f(y_1,\ldots ,y_{\ell })\le 0$ for all $(y_1,\ldots ,y_{\ell })\in M$.

Let $s\ge 1$ and consider all points in M whose sum of coordinates is exactly s, i.e., $M_s=\{(a_1,\ldots , a_{\ell })\in M\mid a_1+\ldots +a_{\ell }=s\}$. Notice that $M_s$ is a convex polytope and f is a convex function on $M_s$ and thus the maximum of f in $M_s$ is attained at one of the vertices of $M_s$. Thus the maximum $(y_1,\ldots , y_{\ell })$ of f in $M_s$ must be of the form $y_1=\ldots =y_{\ell _1}=1$ and $y_{\ell _1+1}=\ldots =y_{\ell }=b$. We obtain for $k=\ell -\ell _1$

$$\begin{aligned} f(y_1,\ldots , y_{\ell })&=(\ell -1)b^{\alpha }+\ell _1+(\ell -\ell _1)b^{\alpha }-(\ell _1+(\ell -\ell _1)b)^{\alpha }\\&=\ell _1+(2\ell -\ell _1-1)b^{\alpha }-(\ell _1+(\ell -\ell _1)b)^{\alpha }\\&=\ell _1+(2k+\ell _1-1)b^{\alpha }-(\ell _1+k b)^{\alpha }. \end{aligned}$$

It is left to show that for all $b\in [0,1]$ and natural numbers $\ell _1\ge 1$, $k\ge 0$ this term is at most 0. Thus we define for $\ell _1,k,b\in {\mathbb {R}}_{\ge 0}$ the function

$$\begin{aligned} g(\ell _1,k,b)=\ell _1+(2k+\ell _1-1)b^{\alpha }-(\ell _1+k b)^{\alpha }. \end{aligned}$$

The partial derivative of g with respect to k is given by

$$\begin{aligned} \frac{\partial }{\partial k}g(\ell _1,k,b)=2b^{\alpha }-\alpha b(\ell _1+bk)^{\alpha -1} \end{aligned}$$

Now suppose that either $\ell _1\ge 2$ and $b\in [0,1]$ or $\ell _1=1, k\ge 1$ and $b\in [0,1]$. In both cases we obtain

$$\begin{aligned} b<0.67(\ell _1+bk)\le \Big (\frac{\alpha }{2}\Big )^{\frac{1}{\alpha -1}}(\ell _1+bk) \end{aligned}$$

and thus

$$\begin{aligned} \frac{\partial }{\partial k}g(\ell _1,k,b)<2b\Big (\frac{\alpha }{2}\Big )^{\frac{\alpha -1}{\alpha -1}}(\ell _1+bk)^{\alpha -1}-\alpha b(\ell _1+bk)^{\alpha -1}=0 \end{aligned}$$

Therefore g is monotonically decreasing for these values and we conclude that the maximum of g for $b\in [0,1]$ and natural numbers $\ell _1\ge 1, k\ge 0$ must be attained at one of the points $(\ell _1,0,b),(1,1,b)$ for $\ell _1\in {\mathbb {N}}_{\ge 1}, b\in [0,1]$. Now

$$\begin{aligned} g(\ell _1,0,b)=\ell _1+(\ell _1-1)b^{\alpha }-\ell _1^{\alpha } \end{aligned}$$

is monotonically increasing in b, so the maximum is attained for $b=1$. Observe that $h(\ell _1)=g(\ell _1,0,1)=2\ell _1-1-\ell _1^{\alpha }$ in concave and we have $h(1)=h(2)=0$. Thus $h(\ell _1)\ge 0$ for $\ell _1\in (1,2)$ and $h(\ell _1)\le 0$ for $\ell _1\in {\mathbb {R}}_{\ge 0}\backslash (1,2)$.

It is left to check the value of g at (1, 1, b) for $b\in [0,1]$. Let $\phi (b)=g(1,1,b)=1+2b^{\alpha }-(1+b)^{\alpha }$. We consider the second derivative of $\phi$ which is given by

$$\begin{aligned} \frac{\textrm{d}^2}{\textrm{d} b^2}\phi (b)=\alpha (\alpha -1)(2b^{\alpha -2}-(1+b)^{\alpha -2}). \end{aligned}$$

Since $b^{\alpha -2}\ge (1+b)^{\alpha -2}$ we obtain that $\frac{\textrm{d}^2}{\textrm{d} b^2}\phi (b)\ge 0$ and therefore $\frac{\textrm{d}}{\textrm{d} b}\phi (b)$ is monotonically increasing. This implies that $\phi$ is convex on [0, 1]. Since $\phi (0)=\phi (1)=0$ and $\phi$ is convex we know that $\phi (b)\le 0$ for all $b\in [0,1]$. This proves the lemma. $\square$

We see that a connected component in $G_{\lceil k^{\alpha }\rceil }$ cannot contain two active clusters, yielding the following upper bound.

Lemma 15

At time $t_{\le \lceil k^{\alpha }\rceil }$ the number of active clusters is less than or equal to the number of active vertices. In other words, $|{\mathscr {H}}_{\lceil k^{\alpha }\rceil } {\setminus } {\mathscr {I}}_{\lceil k^{\alpha }\rceil }| \le |V_{\lceil k^{\alpha }\rceil }^a|$.

Proof

By Lemma 11 we know that every cluster $C \in {\mathscr {H}}_{\lceil k^{\alpha }\rceil } {\setminus } {\mathscr {I}}_{\lceil k^{\alpha }\rceil }$ is fully contained in a connected component $Z_C$ from $G_{\lceil k^{\alpha }\rceil }$. We show that mapping any such C to an arbitrary vertex in $Z_C$ yields an injective map $\varphi : {\mathscr {H}}_{\lceil k^{\alpha }\rceil } {\setminus } {\mathscr {I}}_{\lceil k^{\alpha }\rceil } \hookrightarrow V_{\lceil k^{\alpha }\rceil }^a$. First, notice that $\varphi$ is well-defined: If $Z_C$ contains an inactive vertex, then all its vertices are inactive (Lemma 10), contradicting the choice of C as active.

Suppose now that there are two different clusters $C, C' \in {\mathscr {H}}_{\lceil k^{\alpha }\rceil } {\setminus } {\mathscr {I}}_{\lceil k^{\alpha }\rceil }$ that are mapped to the same vertex $\varphi (C) = \varphi (C')$. Then the connected components $Z_C$ and $Z_{C'}$, in which they are embedded, already have to coincide ($Z_C = Z_{C'}$). But we have just shown (Lemma 13), that ${{\,\textrm{cost}\,}}(Z_C) \le |V(Z_C)|^{\ln (3)/\ln (2)} \le \lceil k^{\alpha }\rceil$ and so C and $C'$ would have already been merged in ${\mathscr {H}}_{\lceil k^{\alpha }\rceil }$. As such the images of both cannot coincide and the map is injective. $\square$

Together with the bound for the number of inactive clusters we are now able to prove the theorem.

Proof

(Theorem 4) Using Lemmas 12 and 15 we obtain $|{\mathscr {H}}_{\lceil k^{\alpha }\rceil }| = |{\mathscr {H}}_{\lceil k^{\alpha }\rceil } {\setminus } {\mathscr {I}}_{\lceil k^{\alpha }\rceil }| + |{\mathscr {I}}_{\lceil k^{\alpha }\rceil }| \le |V_{\lceil k^{\alpha }\rceil }^a| + |V_{\lceil k^{\alpha }\rceil }^i| = k,$ yielding ${{\,\textrm{cost}\,}}({\mathscr {C}}_{k})\le {{\,\textrm{cost}\,}}({\mathscr {H}}_{\lceil k^{\alpha }\rceil })\le \lceil k^{\alpha }\rceil {{\,\textrm{cost}\,}}({\mathscr {O}}_k)$. $\square$

6 The average approximation factor

We have seen previously that the approximation guarantee for complete linkage for the radius is in $\Theta (k)$ and that the same holds for single linkage. This is rather surprising since complete linkage merges clusters based on which merge minimizes the objective function, which is not the case for single linkage. Notice that when we perform complete linkage on the worst case instance $(V(P_k),{{\,\textrm{dist}\,}})$ presented in Sect. 4 complete linkage produces reasonably good clusters for most cluster sizes $\ell \ne k$. Depending on the application we may not need a strong approximation guarantee for all cluster sizes, instead it may be sufficient to find a hierarchical clustering which is a good approximation to the optimal cost for most of the cluster sizes. We try to incorporate this by considering the average approximation factor. The advantage of this new definition is emphasized by the fact that complete linkage for k-center performs asymptotically better than single linkage with respect to this definition. This fits our intuition that complete linkage is more suited for the task of constructing hierarchical clusterings for the k-center objective than single linkage even though both compute a $\Theta (k)$-approximation with respect to the standard definition of an approximation factor.

Definition 4

Let $({\mathscr {C}}_{k})_{k=1}^n$ be an arbitrary hierarchical clustering on $(P,{{\,\textrm{dist}\,}})$ and let $({\mathscr {O}}_k)_{k=1}^n$ be optimal solutions for the radius or diameter. We denote by

$$\begin{aligned} {{\,\textrm{Avg}\,}}(({\mathscr {C}}_{k})_{k=1}^n)=\frac{1}{n}\sum _{k=1}^{n}\frac{{{\,\textrm{cost}\,}}({\mathscr {C}}_k)}{{{\,\textrm{cost}\,}}({\mathscr {O}}_k)} \end{aligned}$$

the average approximation factor of $({\mathscr {C}}_{k})_{k=1}^n$.

The following corollary is an immediate consequence of Proposition 1.

Corollary 4

Let $({\mathscr {C}}_{k})_{k=1}^n$ be the hierarchical clustering computed by complete linkage for the radius. We have

$$\begin{aligned} {{\,\textrm{Avg}\,}}(({\mathscr {C}}_{k})_{k=1}^n)\le \lceil \log (n)\rceil . \end{aligned}$$

However the upper bound of $\lceil \log (n)\rceil$ seems too pessimistic. It would be interesting to know whether this bound is tight or complete linkage in fact computes a constant factor approximation on average and whether similar results hold for the diameter.

Next we give a lower bound on the average approximation factor for single linkage. Let $n=2^s$. Consider the instance $P=\{1,\ldots , n\}\subset {\mathbb {R}}$. We can assume that in the k-th step single linkage merges the two clusters containing $x_{n-k}$ and $x_{n-k+1}$ as the distance between these clusters is 1. The k-clustering computed by single linkage on $(P,\Vert \,\cdot \,\Vert _1)$ then equals

$$\begin{aligned} {\mathscr {C}}_k=\{\{x_1\},\ldots , \{x_{k-1}\},\{x_k,\ldots , x_n\}\}\} \end{aligned}$$

and has diameter $n-k$.

On the other hand for $0\le t\le s$ the optimal $2^t$-clustering has diameter $2^{s-t}-1$ and consists of clusters with $2^{s-t}$ consecutive points in P

$$\begin{aligned} {\mathscr {O}}_{2^{t}}=\{\{x_1,\ldots , x_{2^{s-t}}\},\ldots , \{x_{2^{s-t}(2^{t}-1)+1},\ldots , x_{2^s}\}\}. \end{aligned}$$

Thus we obtain for the diameter

$$\begin{aligned} {{\,\textrm{Avg}\,}}(({\mathscr {C}}_{k})_{k=1}^n)&= \frac{1}{n}\sum _{t=0}^{s-1}\sum _{k\in (2^{t},2^{t+1}]}\frac{{{\,\textrm{cost}\,}}({\mathscr {C}}_k)}{{{\,\textrm{cost}\,}}({\mathscr {O}}_k)}\ge \frac{1}{n}\sum _{t=0}^{s-1}2^t \frac{{{\,\textrm{cost}\,}}({\mathscr {C}}_{2^{t+1}})}{{{\,\textrm{cost}\,}}({\mathscr {O}}_{2^t})}\\&\ge \frac{1}{n}\sum _{t=0}^{s-1}2^t \frac{2^{s}-2^{t+1}}{2^{s-t}}= \frac{1}{n}\sum _{t=0}^{s-1}2^{2t} \frac{2^{s-t}-2}{2^{s-t}}\\&= \frac{1}{n}\Big (\sum _{t=0}^{s-1}4^t-\sum _{t=0}^{s-1}\frac{2^{3t}}{2^{s-1}}\Big )= \frac{1}{2^s}\Big (\sum _{t=0}^{s-1}4^t-\frac{1}{2^{s-1}}\sum _{t=0}^{s-1}8^t\Big )\\&=\frac{1}{2^{s}}\frac{4^{s}-1}{3}-\frac{1}{2^{2s-1}}\frac{8^s-1}{7}=\frac{2^s-2^{-s}}{3}-\frac{2^{s+1}-2^{1-2s}}{7}\\&\ge \frac{n}{21}-1 \end{aligned}$$

The same computation can be done for the radius, as the radius of ${\mathscr {C}}_{2^{t+1}}$ equals $\frac{2^s-2^{t+1}}{2}$ and the radius of ${\mathscr {O}}_{2^{t}}$ equals $\frac{2^{s-t}}{2}$.

Corollary 5

The average approximation factor achieved by single linkage on $(P,\Vert \,\cdot \,\Vert _1)$ for both radius and diameter is at least $\frac{n}{21}-1$.

We conclude that the average approximation factor achieved by complete linkage for k-center is asymptotically better than the average approximation factor achieved by single linkage. In general it may be of interest to consider other ways to measure the quality of hierarchical clusterings computed by complete linkage and single linkage.

Availability of data and materials

Not applicable.

Code Availability

Not applicable.

References

Ackermann, M. R., Blömer, J., Kuntze, D., & Sohler, C. (2014). Analysis of agglomerative clustering. Algorithmica, 69(1), 184–215. https://doi.org/10.1007/s00453-012-9717-4
Article MathSciNet Google Scholar
Ahmadian, S., Norouzi-Fard, A., Svensson, O., & Ward, J. (2020). Better guarantees for k-means and Euclidean k-median by primal–dual algorithms. SIAM Journal on Computing. https://doi.org/10.1137/18M1171321
Article MathSciNet Google Scholar
Arutyunova, A., & Röglin, H. (2022). The price of hierarchical clustering. In 30th Annual European symposium on algorithms, ESA 2022 (Vol. 244, pp. 10:1–10:14). https://doi.org/10.4230/LIPIcs.ESA.2022.10.
Arutyunova, A., Großwendt, A., Röglin, H., Schmidt, M., & Wargalla, J. (2021). Upper and lower bounds for complete linkage in general metric spaces. In Approximation, randomization, and combinatorial optimization. Algorithms and techniques, APPROX/RANDOM 2021 (Vol. 207, pp. 18:1–18:22). https://doi.org/10.4230/LIPIcs.APPROX/RANDOM.2021.18.
Bock, F. (2022). Hierarchy cost of hierarchical clusterings. Journal of Combinatorial Optimization. https://doi.org/10.1007/s10878-022-00851-4
Article MathSciNet Google Scholar
Byrka, J., Pensyl, T. W., Rybicki, B., Srinivasan, A., & Trinh, K. (2017). An improved approximation for k-median and positive correlation in budgeted optimization. ACM Transactions on Algorithms, 13(2), 23:1-23:31. https://doi.org/10.1145/2981561
Article MathSciNet Google Scholar
Charikar, M., Chekuri, C., Feder, T., & Motwani, R. (2004). Incremental clustering and dynamic information retrieval. SIAM Journal on Computing, 33(6), 1417–1440. https://doi.org/10.1137/S0097539702418498
Article MathSciNet Google Scholar
Dasgupta, S., & Long, P. M. (2005). Performance guarantees for hierarchical clustering. Journal of Computer and System Sciences, 70(4), 555–569. https://doi.org/10.1016/j.jcss.2004.10.006
Article MathSciNet Google Scholar
Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 293–306. https://doi.org/10.1016/0304-3975(85)90224-5
Article MathSciNet Google Scholar
Großwendt, A., & Röglin, H. (2017). Improved analysis of complete-linkage clustering. Algorithmica, 78(4), 1131–1150. https://doi.org/10.1007/s00453-017-0284-6
Article MathSciNet Google Scholar
Großwendt, A., Röglin, H., & Schmidt, M. (2019). Analysis of ward’s method. In Chan, T. M. (Ed.), Proceedings of the Thirtieth annual ACM-SIAM symposium on discrete algorithms, SODA (pp. 2939–2957). SIAM. https://doi.org/10.1137/1.9781611975482.182.
Großwendt, A. K. (2020). Theoretical analysis of hierarchical clustering and the shadow vertex algorithm. Ph.D. thesis, University of Bonn. http://hdl.handle.net/20.500.11811/8348
Hershkowitz, D. E., & Kehne, G. (2020). Reverse greedy is bad for k-center. Information Processing Letters, 158, 105941. https://doi.org/10.1016/j.ipl.2020.105941
Article MathSciNet Google Scholar
Hochbaum, D. S. (1984). When are np-hard location problems easy? Annals of Operations Research, 1(3), 201–214. https://doi.org/10.1007/BF01874389
Article Google Scholar
Hochbaum, D. S., & Shmoys, D. B. (1985). A best possible heuristic for the k-center problem. Mathematical Operations Research, 10(2), 180–184. https://doi.org/10.1287/moor.10.2.180
Article MathSciNet Google Scholar
Hsu, W., & Nemhauser, G. L. (1979). Easy and hard bottleneck location problems. Discrete Applied Mathematics, 1(3), 209–215. https://doi.org/10.1016/0166-218X(79)90044-1
Article MathSciNet Google Scholar
Lin, G., Nagarajan, C., Rajaraman, R., & Williamson, D. P. (2010). A general approach for incremental approximation and hierarchical clustering. SIAM Journal on Computing, 39(8), 3633–3669. https://doi.org/10.1137/070698257
Article MathSciNet Google Scholar
Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236–244. https://doi.org/10.1080/01621459.1963.10500845
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 416767905, 390685813. The authors would also like to thank an anonymous reviewer for suggesting an improvement for the calculations in Lemma 13 which led to an improvement of the upper bound for k-diameter from $O(k^2)$ to $O(k^{\ln (3)/\ln (2)})$.

Funding

Open Access funding enabled and organized by Projekt DEAL. This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 416767905, 390685813.

Author information

Authors and Affiliations

Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
Anna Arutyunova, Anna Großwendt & Heiko Röglin
Lamarr Institute for Machine Learning and Artificial Intelligence, Dortmund, Germany
Heiko Röglin
Heinrich-Heine-Universität Düsseldorf, Düsseldorf, Germany
Melanie Schmidt & Julian Wargalla

Authors

Anna Arutyunova
View author publications
You can also search for this author in PubMed Google Scholar
Anna Großwendt
View author publications
You can also search for this author in PubMed Google Scholar
Heiko Röglin
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Julian Wargalla
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally to this work.

Corresponding author

Correspondence to Anna Arutyunova.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Ethical approval

Not applicable.

Additional information

Editor: Aryeh Kontorovich.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

An earlier version of this work previously appeared on APPROX/RANDOM 2021 (Arutyunova et al., 2021).

A Lower bound for complete linkage without bad ties

In this section we focus on modifying the instance $(V(P_k),{{\,\textrm{dist}\,}})$ such that merging two $\ell$-graphs $G_{\ell }, G'_{\ell }$ which are part of the same $(\ell +1)$-graph is slightly cheaper than performing any other merge in a clustering consisting of all $\ell$-graphs.

1.1 A.1 Diameter-based cost

We explain how to adjust the construction of the k-components for the diameter. Let $\epsilon \in (0,\frac{1}{2})$. The definition of $K_1$ stays the same. As before a k-component is constructed from two copies $K_{k-1}^{(0)},K_{k-1}^{(1)}$ of the $(k-1)$-component by taking the disjoint union of the corresponding graphs and increasing the level of each point in $K_{k-1}^{(1)}$ by one. Here we do not add an edge of weight $k-1$ between the unique point $s\in V(G_{k-1}^{(0)})$ with level 1 and $t\in V(G_{k-1}^{(1)})$ with level k. Instead we complete $G_k$ by adding edges of weight $(k-1)(1-\epsilon )$ between $x\in V(G_{k-1}^{(0)})$ and $y\in V(G_{k-1}^{(1)})$ if they are not on the same level, i.e., $\phi _k(x)\ne \phi _k(y)$.

The instance $P_k$ is then constructed from k-copies $K_k^{(1)}, \ldots , K_k^{(k)}$ of the k-component $K_k$. We take the disjoint union of the corresponding k-graphs $G_k^{(1)}, \ldots , G_k^{(k)}$ and connect them by adding edges $\{x, y\}$ of weight 1 for every two points $x \in V(G_k^{(i)})$ and $y \in V(G_k^{(j)})$ with $\phi _k^{(i)}(x) = \phi _k^{(j)}(y)$.

We show that the clustering computed by complete linkage on $(V(P_k),{{\,\textrm{dist}\,}})$ at time $t_{\le \ell (1-\epsilon )}$ consists exactly of the $(\ell +1)$-graphs that make up the instance.

Lemma 16

The distance between two points $x, y \in V(P_k)$ is at least $|\phi _k(x) - \phi _k(y)|(1-\epsilon ).$

Proof

By the inductive construction of the components, an edge which crosses w levels costs at least $w(1-\epsilon )$. Hence the distance between x and y is at least $|\phi _k(x) - \phi _k(y)|(1-\epsilon )$. $\square$

As before we use the previous lemma to show that the diameter of any $\ell$-graph in $P_k$ is $(\ell - 1)(1-\epsilon )$.

Lemma 17

Let $G_\ell$ be an $\ell$-graph contained in $P_k$. We have ${{\,\textrm{cost}\,}}(G_\ell )=(\ell - 1)(1-\epsilon )$.

Proof

We prove the upper bound ${{\,\textrm{cost}\,}}(G_\ell ) \le (\ell - 1)(1-\epsilon )$ by induction. The 1-graphs are points and so the claim follows trivially for $\ell = 1$. Assume now that we have shown the claim for $\ell -1$. Let $G_{\ell }$ be an $\ell$-graph and $s,t \in V(G_\ell )$ points such that ${{\,\textrm{cost}\,}}(G_\ell )={{\,\textrm{dist}\,}}(s,t)$. If these points lie in the same graph, say $G_{\ell -1}^{(0)}$, of the two $(\ell -1)$-graphs $G_{\ell -1}^{(0)}$ and $G_{\ell -1}^{(1)}$ that make up $G_\ell$, then

$$\begin{aligned} {{\,\textrm{dist}\,}}(s,t) \le {{\,\textrm{cost}\,}}(G_{\ell -1}^{(0)}) \le (\ell -2)(1-\epsilon ) < (\ell - 1)(1-\epsilon ) \end{aligned}$$

by induction and we are done. Otherwise we may assume that $s \in V(G_{\ell -1}^{(0)})$ and $t \in V(G_{\ell -1}^{(1)})$. We distinguish two cases. If $\phi _k(s)=\phi _k(t)$ these points are connected by an edge of weight one by construction. Notice that an $\ell$-graph does not contain points with the same level if $\ell \le 2$. Using $\epsilon \le \frac{1}{2}$ and $\ell \ge 3$ we obtain

$$\begin{aligned} {{\,\textrm{dist}\,}}(s,t)=1\le (\ell -1)(1-\epsilon ). \end{aligned}$$

If s and t are on different levels there is an edge of weight $(\ell -1)(1-\epsilon )$ between s and t by construction. Thus we obtain in all cases

$$\begin{aligned} {{\,\textrm{cost}\,}}(G_\ell ) = {{\,\textrm{dist}\,}}(s,t) \le (\ell -1)(1-\epsilon ). \end{aligned}$$

To see the lower bound ${{\,\textrm{cost}\,}}(G_{\ell }) \ge (\ell - 1)(1-\epsilon )$, we apply Lemma 16 to the unique point $s \in V(G_\ell )$ with $\phi _\ell (s) = 1$ and the unique point $t \in V(G_\ell )$ with $\phi _\ell (t) = \ell$. This shows that ${{\,\textrm{cost}\,}}(G_{\ell }) \ge {{\,\textrm{dist}\,}}(s,t) \ge (\ell - 1)(1-\epsilon )$. $\square$

We show that complete linkage must reconstruct these components as clusters.

Lemma 18

Complete linkage must merge clusters on $(V(P_k),{{\,\textrm{dist}\,}})$ in such a way that for all $\ell <k$, the clustering ${\mathscr {H}}_{\ell (1 - \epsilon )}$ consists exactly of the $(\ell +1)$-graphs that make up $P_k$.

Proof

We prove the claim by induction. Complete linkage always starts with every point in a separate cluster. Since those are exactly the 1-graphs and any merge of two points costs at least $(1-\epsilon )$, the claim follows for $\ell = 0$. Suppose now that ${\mathscr {H}}_{(\ell - 1)(1-\epsilon )}$ consists exactly of the $\ell$-graphs of the instance. Consider two $\ell$-graphs $G_{\ell } \ne G_{\ell }'$ contained in the current clustering. We compute the cost of merging $G_{\ell }$ with $G'_{\ell }$. For this purpose we distinguish whether they are contained in the same $(\ell +1)$-graph or not.

Case 1::: If they are contained in the same $(\ell +1)$-graph $G_{\ell +1}$ merging $G_{\ell }$ with $G'_{\ell }$ results in $G_{\ell +1}$. We obtain by Lemma 17, that cost ${{\,\textrm{cost}\,}}(G_{\ell +1})=\ell (1-\epsilon )$.
Case 2::: If they are not contained in the same $(\ell +1)$-graph, we show that merging $G_{\ell }$ with $G'_{\ell }$ costs more than $\ell (1-\epsilon )$. We make the following observations.

1.
The edges connecting $x\in V(G_{\ell })$ and $y\in V(G'_{\ell })$ with $\phi _k(x)\ne \phi _k(y)$ are of weight $\ge (\ell +1)(1-\epsilon )$.
2.
There exist $s\in V(G_{\ell })$ and $t\in V(G'_{\ell })$ with $|\phi _k(s)-\phi _k(t)|\ge \ell -1$.

The last observation follows from the fact that each of the graphs contains two points whose difference in levels is exactly $\ell -1$.

We prove that ${{\,\textrm{dist}\,}}(s,t)>\ell (1-\epsilon )$ and therefore merging $G_{\ell }$ with $G'_{\ell }$ costs more than $\ell (1-\epsilon )$. Any shortest path connecting s and t in $P_k$ must contain an edge $\{u,w\}$ between a point $u\in V(G_{\ell })$ and a point $w\in V(G'_{\ell })$. By above observation this edge is either of weight $\ge (\ell +1)(1-\epsilon )$ or u and w are on the same level and the edge is of weight 1. In the first case we conclude

$$\begin{aligned} {{\,\textrm{dist}\,}}(s,t)\ge (\ell +1)(1-\epsilon )>\ell (1-\epsilon ). \end{aligned}$$

In the second case we obtain that

$$\begin{aligned} {{\,\textrm{dist}\,}}(s,t)&={{\,\textrm{dist}\,}}(s,u)+1+{{\,\textrm{dist}\,}}(w,t)\\&\ge |\phi _k(s)-\phi _k(u)|(1-\epsilon )+1+|\phi _k(w)-\phi _k(t)|(1-\epsilon )\\&=|\phi _k(s)-\phi _k(t)|(1-\epsilon )+1\\&\ge (\ell -1)(1-\epsilon )+1\\&>\ell (1-\epsilon ). \end{aligned}$$

We see that ${\mathscr {H}}_{\ell (1-\epsilon )}$ must consists exactly of the $(\ell +1)$-graphs of $P_k$. $\square$

Lemma 18 shows that ${\mathscr {H}}_{(k-1)(1-\epsilon )}$ consists of all the k-graphs that make up $P_k$. There are exactly k of them, thus the k-clustering produced by complete linkage costs $(k-1)(1-\epsilon )$.

Corollary 6

However the tie-breaks are resolved, complete linkage computes a k-clustering on $(V(P_k),{{\,\textrm{dist}\,}})$ with diameter $(k-1)(1-\epsilon )$ while the optimal k-clustering has diameter 1.

1.2 A.2 Radius-based cost

We explain how to adjust the construction of the k-components for the radius. Let $\epsilon \in (0,\frac{1}{2})$. The definition of $K_1$ does not change. As before a k-component is constructed from two copies $K_{k-1}^{(0)},K_{k-1}^{(1)}$ of the $(k-1)$-component by taking the disjoint union of the corresponding graphs and increasing the level of each point in $K_{k-1}^{(1)}$ by one. We complete $G_k$ by adding edges between $x\in V(G_{k-1}^{(0)})$ and $y\in V(G_{k-1}^{(1)})$ if $\phi _k(x)\ne \phi _k(y)$ and we assign this edge a weight of $\lceil \frac{k}{2}\rceil (1-\epsilon )$ if $|\phi _k(x)-\phi _k(y)|\le \lceil \frac{k}{2}\rceil -1$ and otherwise a weight of $|\phi _k(x)-\phi _k(y)|(1-\epsilon )$.

As before the instance $P_k$ is constructed from k-copies $K_k^{(1)}, \ldots , K_k^{(k)}$ of the k-component $K_k$. We take the disjoint union of the corresponding k-graphs $G_k^{(1)}, \ldots , G_k^{(k)}$ and connect them by adding edges $\{x, y\}$ of weight 1 for every two points $x \in V(G_k^{(i)})$ and $y \in V(G_k^{(j)})$ with $\phi _k^{(i)}(x) = \phi _k^{(j)}(y)$. We observe that Lemma 16 still holds on the adjusted instance. Also notice that the diameter of an $\ell$-graph is still upper bounded by $(\ell -1)(1-\epsilon )$.

Lemma 19

Let $G_{2\ell }$ be any of the $2\ell$-graphs that constitute $P_k$ for $1\le \ell \le \frac{k}{2}$. It holds that ${{\,\textrm{cost}\,}}(G_{2\ell })=\ell (1-\epsilon )$. Furthermore let $G'_{2\ell }$ be a second $2\ell$-graph which is not contained in the same $2(\ell +1)$-graph as $G_{2\ell }$. Any cluster containing $G_{2\ell }$ and $G'_{2\ell }$ costs at least $\ell (1-\epsilon )+1$.

Proof

We know that $G_{2\ell }$ contains points s and t with $|\phi _k(s)-\phi _k(t)|=2\ell -1$. Thus for any $x\in V(P_k)$ we have $\max \{|\phi _k(s)-\phi _k(x)|,|\phi _k(t)-\phi _k(x)|\}\ge \ell$. By Lemma 16 we know that $\max \{{{\,\textrm{dist}\,}}(s,x),{{\,\textrm{dist}\,}}(t,x)\}\ge \ell (1-\epsilon )$ and therefore ${{\,\textrm{cost}\,}}(G_{2\ell })\ge \ell (1-\epsilon )$.

To prove the upper bound suppose that $G_{2\ell }$ covers the levels $\lambda$ up to $\lambda + 2\ell - 1$ in $P_k$. Consider the unique $(\ell +1)$-graph $H_{\ell +1}$ contained in $G_{2\ell }$ covering the levels $\lambda +\ell -1$ to $\lambda +2\ell -1$. Let c be the unique point in $H_{\ell +1}$ with level $\lambda +\ell -1$. Remember that the diameter of $H_{\ell +1}$ is at most $\ell (1-\epsilon )$, so any point in $H_{\ell +1}$ is at distance $\le \ell (1-\epsilon )$ to c. Consider now a point $x\in V(G_{2\ell })\backslash V(H_{\ell +1})$. We know that $\phi _k(x)<\lambda +2\ell -1$. Thus $|\phi _k(x)-\phi _k(c)|\le \ell -1$. By construction there exists an edge of weight at most $\ell (1-\epsilon )$ between x and c and thus ${{\,\textrm{dist}\,}}(x,c)\le \ell (1-\epsilon )$.

It is left to show that any cluster containing $G_{2\ell }$ and $G'_{2\ell }$ costs at least $\ell (1-\epsilon )+1$. Let $y\in V(P_k)$ and let $H_{2(\ell +1)}$ be the $2(\ell +1)$-graph containing y. Assume without loss of generality that $G_{2\ell }$ is not contained in $H_{2(\ell +1)}$. Let $x\in V(G_{2\ell })$ be a point with $|\phi _k(x)-\phi _k(y)|\ge \ell$. We claim that ${{\,\textrm{dist}\,}}(x,y)\ge (\ell -1)(1-\epsilon )+1$. A shortest path connecting x and y must contain an edge $\{u,w\}$ with $u\in V(P_k)\backslash V(H_{2(\ell +1)})$ and $w\in V(H_{2(\ell +1)})$. We know by construction that either $\phi _k(u)=\phi _k(w)$, or the edge weights at least $(\ell +2)(1-\epsilon )$. In the first case we use Lemma 16 and obtain

$$\begin{aligned} {{\,\textrm{dist}\,}}(x,y)&={{\,\textrm{dist}\,}}(x,u)+{{\,\textrm{dist}\,}}(u,w)+{{\,\textrm{dist}\,}}(w,y)\\&\ge |\phi _k(x)-\phi _k(u)|(1-\epsilon )+1+|\phi _k(w)-\phi _k(y)|(1-\epsilon )\\&=|\phi _k(x)-\phi _k(y)|(1-\epsilon )+1\\&\ge \ell (1-\epsilon )+1 \end{aligned}$$

and in the second case we obtain

$$\begin{aligned} {{\,\textrm{dist}\,}}(x,y)\ge (\ell +2)(1-\epsilon )\ge \ell (1-\epsilon )+1. \end{aligned}$$

$\square$

This immediately leads the following results.

Corollary 7

Complete linkage must merge clusters on $(V(P_k),{{\,\textrm{dist}\,}})$ in such a way that for all $1\le \ell \le \frac{k}{2}$, the clustering ${\mathscr {H}}_{\ell (1 - \epsilon )}$ consists exactly of the $2\ell$-graphs that make up $P_k$.

Corollary 8

However the tie-breaks are resolved, complete linkage computes a k-clustering on $(V(P_k),{{\,\textrm{dist}\,}})$ with radius $\frac{k}{2}(1-\epsilon )$, while the optimal k-clustering has radius 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Arutyunova, A., Großwendt, A., Röglin, H. et al. Upper and lower bounds for complete linkage in general metric spaces. Mach Learn 113, 489–518 (2024). https://doi.org/10.1007/s10994-023-06486-8

Download citation

Received: 20 January 2023
Revised: 28 August 2023
Accepted: 02 November 2023
Published: 30 November 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s10994-023-06486-8

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Upper and lower bounds for complete linkage in general metric spaces

Abstract

Similar content being viewed by others

Improved Analysis of Complete-Linkage Clustering

Improved Analysis of Complete-Linkage Clustering

Approximation Algorithms for Min-Sum k-Clustering and Balanced k-Median

Explore related subjects

1 Introduction

1.1 Related work

1.2 Our results

1.3 Techniques

2 Preliminaries

3 Approximation guarantee of single linkage

Lemma 1

Proof

Theorem 1

Proof

4 Lower bounds for complete linkage

Theorem 2

4.1 A lower bound for diameter-based cost

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Proof

4.2 A lower bound for radius-based costs

Lemma 5

Proof

Lemma 6

Proof

Corollary 1

Proof

5 An upper bound for complete linkage

5.1 An upper bound for radius-based cost

Theorem 3

5.1.1 Part 1: an estimate of the relative difference in cost

Lemma 7

Proof

Corollary 2

Proof

Proposition 1

Proof

5.1.2 Part 2: a cheap clustering with few clusters

Definition 1

Lemma 8

Proof

Lemma 9

Proof

Corollary 3

5.2 An upper bound for diameter-based cost

Theorem 4

Definition 2

Lemma 10

Proof

Definition 3

Lemma 11

Proof

Lemma 12

Proof

Lemma 13

Proof

Lemma 14

Proof

Lemma 15

Proof

Proof

6 The average approximation factor

Definition 4

Corollary 4

Corollary 5

Availability of data and materials

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions