Clustering Hierarchical Algorithms
Clustering Hierarchical Algorithms
Introduction
Agglomerative hierarchical clustering has been the dominant approach to constructing embedded classification schemes. It is our aim to direct the readers
attention to practical algorithms and methods both efficient (from the computational and storage points of view) and effective (from the application point of
view). It is often helpful to distinguish between method, involving a compactness
criterion and the target structure of a 2-way tree representing the partial order
on subsets of the power set; as opposed to an implementation, which relates to
the detail of the algorithm used.
As with many other multivariate techniques, the objects to be classified have
numerical measurements on a set of variables or attributes. Hence, the analysis
is carried out on the rows of an array or matrix. If we do not have a matrix of
numerical values to begin with, then it may be necessary to skilfully construct
such a matrix. The objects, or rows of the matrix, can be viewed as vectors in
a multidimensional space (the dimensionality of this space being the number of
variables or columns). A geometric framework of this type is not the only one
which can be used to formulate clustering algorithms. Suitable alternative forms
of storage of a rectangular array of values are not inconsistent with viewing the
problem in geometric terms (and in matrix terms for example, expressing the
adjacency relations in a graph).
1
Motivation for clustering in general, covering hierarchical clustering and applications, includes the following: analysis of data; interactive user interfaces;
storage and retrieval; and pattern recognition.
Surveys of clustering with coverage also of hierarchical clustering include
Gordon (1981), March (1983), Jain and Dubes (1988), Gordon (1987), Mirkin
(1996), Jain, Murty and Flynn (1999), and Xu and Wunsch (2005). Lerman
(1981) and Janowitz (2010) present overarching reviews of clustering including
through use of lattices that generalize trees. The case for the central role of hierarchical clustering in information retrieval was made by van Rijsbergen (1979)
and continued in the work of Willett (e.g. Griffiths et al., 1984) and others.
Various mathematical views of hierarchy, all expressing symmetry in one way
or another, are explored in Murtagh (2009).
This article is organized as follows.
In section 2 we look at the issue of normalization of data, prior to inducing
a hierarchy on the data.
In section 3 some historical remarks and motivation are provided for hierarchical agglomerative clustering.
In section 4, we discuss the Lance-Williams formulation of a wide range
of algorithms, and how these algorithms can be expressed in graph theoretic
terms and in geometric terms. In section 5, we describe the principles of the
reciprocal nearest neighbor and nearest neighbor chain algorithm, to support
building a hierarchical clustering in a more efficient manner compared to the
Lance-Williams or general geometric approaches.
In section 6 we overview the hierarchical Kohonen self-organizing feature
map, and also hierarchical model-based clustering. We conclude this section
with some reflections on divisive hierarchical clustering, in general.
Section 7 surveys developments in grid- and density-based clustering. The
following section, section 8, presents a recent algorithm of this type, which is
particularly suitable for the hierarchical clustering of massive data sets.
(1)
i=1
xa xb
kxa kkxb k
(2)
which answers such questions as: How many useful groups are in this data?,
What are the salient interrelationships present?. But it can be noted that
differing answers can feasibly be provided by a dendrogram for most of these
questions, depending on the application.
A wide range of agglomerative hierarchical clustering algorithms have been proposed at one time or another. Such hierarchical algorithms may be conveniently
broken down into two groups of methods. The first group is that of linkage
methods the single, complete, weighted and unweighted average linkage methods. These are methods for which a graph representation can be used. Sneath
and Sokal (1973) may be consulted for many other graph representations of the
stages in the construction of hierarchical clusterings.
The second group of hierarchical clustering methods are methods which allow
the cluster centers to be specified (as an average or a weighted average of the
member vectors of the cluster). These methods include the centroid, median
and minimum variance methods.
The latter may be specified either in terms of dissimilarities, alone, or alternatively in terms of cluster center coordinates and dissimilarities. A very
convenient formulation, in dissimilarity terms, which embraces all the hierarchical methods mentioned so far, is the Lance-Williams dissimilarity update
formula. If points (objects) i and j are agglomerated into cluster i j, then
we must simply specify the new dissimilarity between the cluster and all other
points (objects or clusters). The formula is:
d(i j, k) = i d(i, k) + j d(j, k) + d(i, j) + | d(i, k) d(j, k) |
where i , j , , and define the agglomerative criterion. Values of these are
listed in the second column of Table 1. In the case of the single link method,
using i = j = 12 , = 0, and = 21 gives us
1
1
1
d(i, k) + d(j, k) | d(i, k) d(j, k) |
2
2
2
which, it may be verified, can be rewritten as
d(i j, k) =
Hierarchical
clustering
methods (and
aliases)
Single link
(nearest
neighbor)
i = 0.5
=0
= 0.5
(More simply:
min{dik , djk })
i = 0.5
=0
= 0.5
(More simply:
max{dik , djk })
|i|
i = |i|+|j|
=0
=0
i = 0.5
=0
=0
i = 0.5
= 0.25
=0
|i|
i = |i|+|j|
g=
gi +gj
2
kgi gj k2
g=
|i|gi +|j|gj
|i|+|j|
kgi gj k2
=
=0
|i|+|k|
i = |i|+|j|+|k|
g=
|i|gi +|j|gj
|i|+|j|
|i||j|
|i|+|j| kgi
Complete link
(diameter)
Group average
(average link,
UPGMA)
McQuittys
method
(WPGMA)
Median method
(Gowers,
WPGMC)
Centroid
Coordinates
of center of
cluster, which
agglomerates
clusters i and j
|i||j|
(|i|+|j|)
2
(UPGMC)
Wards method
|k|
|i|+|j|+|k|
Dissimilarity
between cluster
centers gi and gj
gj k 2
(minimum var=
iance, error
=0
sum of squares)
Notes: | i | is the number of objects in cluster i. gi is a vector in m-space (m
is the set of attributes), either an intial point or a cluster center. k.k is the
norm in the Euclidean metric. The names UPGMA, etc. are due to Sneath and
Sokal (1973). Coefficient j , with index j, is defined identically to coefficient
i with index i. Finally, the Lance and Williams recurrence formula is (with
| . | expressing absolute value):
dij,k = i dik + j djk + dij + | dik djk | .
Table 1: Specifications of seven hierarchical clustering methods.
equivalence between the two approaches. In the case of the median method, for
instance, we have the following (cf. Table 1).
Let a and b be two points (i.e. m-dimensional vectors: these are objects
or cluster centers) which have been agglomerated, and let c be another point.
From the Lance-Williams dissimilarity update formula, using squared Euclidean
distances, we have:
d2 (a b, c) =
d2 (a,c)
2
kack2
2
+
+
d2 (b,c)
2
kbck2
2
d2 (a,b)
4
kabk2
.
4
(3)
In steps 1 and 2, point refers either to objects or clusters, both of which are
defined as vectors in the case of cluster center methods. This algorithm is justified by storage considerations, since we have O(n) storage required for n initial
objects and O(n) storage for the n 1 (at most) clusters. In the case of linkage
methods, the term fragment in step 2 refers (in the terminology of graph theory) to a connected component in the case of the single link method and to a
clique or complete subgraph in the case of the complete link method. Without
consideration of any special algorithmic speed-ups, the overall complexity of
the above algorithm is O(n3 ) due to the repeated calculation of dissimilarities in
step 1, coupled with O(n) iterations through steps 1, 2 and 3. While the stored
data algorithm is instructive, it does not lend itself to efficient implementations.
In the section to follow, we look at the reciprocal nearest neighbor and mutual
7
|c1 | |c2 |
|c1 |+|c2 | kc1
c2 k2 ,
Early, efficient algorithms for hierarchical clustering are due to Sibson (1973),
Rohlf (1973) and Defays (1977). Their O(n2 ) implementations of the single link
method and of a (non-unique) complete link method, respectively, have been
widely cited.
8
s s
c
s
e
d2
d1
d1
d2
10
It is quite impressive how 2D (2-dimensional or, for that matter, 3D) image
signals can handle with ease the scalability limitations of clustering and many
other data processing operations. The contiguity imposed on adjacent pixels or
grid cells bypasses the need for nearest neighbor finding. It is very interesting
therefore to consider the feasibility of taking problems of clustering massive data
sets into the 2D image domain. The Kohonen self-organizing feature map exemplifes this well. In its basic variant (Kohonen, 1984, 2001) is can be formulated
in terms of k-means clustering subject to a set of interrelationships between the
cluster centers (Murtagh and Fern
andez-Pajares, 1995).
Kohonen maps lend themselves well for hierarchical representation. Lampinen
and Oja (1992), Dittenbach et al. (2002) and Endo et al. (2002) elaborate on
the Kohonen map in this way. An example application in character recognition
is Miikkulanien (1990).
A short, informative review of hierarchical self-organizing maps is provided
11
by Vicente and Vellido (2004). These authors also review what they term
as probabilistic hierarchical models. This includes putting into a hierarchical framework the following: Gaussian mixture models, and a probabilistic
Bayesian alternative to the Kohonen self-organizing map termed Generative
Topographic Mapping (GTM).
GTM can be traced to the Kohonen self-organizing map in the following way.
Firstly, we consider the hierarchical map as brought about through a growing
process, i.e. the target map is allowed to grow in terms of layers, and of grid
points within those layers. Secondly, we impose an explicit probability density
model on the data. Tino and Nabney (2002) discuss how the local hierarchical
models are organized in a hierarchical way.
In Wang et al. (2000) an alternating Gaussian mixture modeling, and principal component analysis, is described, in this way furnishing a hierarchy of
model-based clusters. AIC, the Akaike information criterion, is used for selection of the best cluster model overall.
Murtagh et al. (2005) use a top level Gaussian mixture modeling with the
(spatially aware) PLIC, pseudo-likelihood information criterion, used for cluster
selection and identifiability. Then at the next level and potentially also for
further divisive, hierarchical levels the Gaussian mixture modeling is continued but now using the marginal distributions within each cluster, and using
the analogous Bayesian clustering identifiability criterion which is the Bayesian
information criterion, BIC. The resulting output is referred to as a model-based
cluster tree.
The model-based cluster tree algorithm of Murtagh et al. (2005) is a divisive
hierarchical algorithm. Earlier in this article, we considered agglomerative algorithms. However it is often feasible to implement a divisive algorithm instead,
especially when a graph cut (for example) is important for the application concerned. Mirkin (1996, chapter 7) describes divisive Ward, minimum variance
hierarchical clustering, which is closely related to a bisecting k-means also.
A class of methods under the name of spectral clustering uses eigenvalue/eigenvector
reduction on the (graph) adjacency matrix. As von Luxburg (2007) points out
in reviewing this field of spectral clustering, such methods have been discovered, re-discovered, and extended many times in different communities. Far
from seeing this great deal of work on clustering in any sense in a pessimistic
way, we see the perennial and pervasive interest in clustering as testifying to
the continual renewal and innovation in algorithm developments, faced with
application needs.
It is indeed interesting to note how the clusters in a hierarchical clustering
may be defined by the eigenvectors of a dissimilarity matrix, but subject to
carrying out the eigenvector reduction in a particular algebraic structure, a semiring with additive and multiplicative operations given by min and max,
respectively (Gondran, 1976).
In the next section, section 7, the themes of mapping, and of divisive algorithm, are frequently taken in a somewhat different direction. As always, the
application at issue is highly relevant for the choice of the hierarchical clustering
algorithm.
12
Many modern clustering techniques focus on large data sets. In Xu and Wunsch
(2008, p. 215) these are classified as follows:
Random sampling
Data condensation
Density-based approaches
Grid-based approaches
Divide and conquer
Incremental learning
From the point of view of this article, we select density and grid based
approaches, i.e., methods that either look for data densities or split the data
space into cells when looking for groups. In this section we take a look at these
two families of methods.
The main idea is to use a grid-like structure to split the information space,
separating the dense grid regions from the less dense ones to form groups.
In general, a typical approach within this category will consist of the following steps as presented by Grabusts and Borisov (2002):
1. Creating a grid structure, i.e. partitioning the data space into a finite
number of non-overlapping cells.
2. Calculating the cell density for each cell.
3. Sorting of the cells according to their densities.
4. Identifying cluster centers.
5. Traversal of neighbor cells.
Some of the more important algorithms within this category are the following:
STING: STatistical INformation Grid-based clustering was proposed by
Wang et al. (1997) who divide the spatial area into rectangular cells represented by a hierarchical structure. The root is at hierarchical level 1, its
children at level 2, and so on. This algorithm has a computational complexity of O(K), where K is the number of cells in the bottom layer. This
implies that scaling this method to higher dimensional spaces is difficult
(Hinneburg and Keim, 1999). For example, if in high dimensional data
space each cell has four children, then the number of cells in the second
level will be 2m , where m is the dimensionality of the database.
13
In the last section, section 7, we have seen a number of clustering methods that
split the data space into cells, cubes, or dense regions to locate high density
areas that can be further studied to find clusters.
For large data sets clustering via an m-adic (m integer, which if a prime is
usually denoted as p) expansion is possible, with the advantage of doing so in
linear time for the clustering algorithm based on this expansion. The usual base
10 system for numbers is none other than the case of m = 10 and the base 2
or binary system can be referred to as 2-adic where p = 2. Let us consider the
following distance relating to the case of vectors x and y with 1 attribute, hence
unidimensional:
1
if x1 6= y1
d B (x, y) =
(5)
inf mk
xk = yk 1 k |K|
15
This distance defines the longest common prefix of strings. A space of strings,
with this distance, is a Baire space. Thus we call this the Baire distance: here the
longer the common prefix, the closer a pair of sequences. What is of interest to
us here is this longest common prefix metric, which is an ultrametric (Murtagh
et al., 2008).
For example, let us consider two such values, x and y. We take x and y to
be bounded by 0 and 1. Each are of some precision, and we take the integer |K|
to be the maximum precision.
Thus we consider ordered sets xk and yk for k K. So, k = 1 is the index
of the first decimal place of precision; k = 2 is the index of the second decimal
place; . . . ; k = |K| is the index of the |K|th decimal place. The cardinality of
the set K is the precision with which a number, x, is measured.
Consider as examples xk = 0.478; and yk = 0.472. In these cases, |K| = 3.
Start from the first decimal position. For k = 1, we have xk = yk = 4. For
k = 2, xk = yk . But for k = 3, xk 6= yk . Hence their Baire distance is 102 for
base m = 10.
It is seen that this distance splits a unidimensional string of decimal values
into a 10-way hierarchy, in which each leaf can be seen as a grid cell. From
equation (5) we can read off the distance between points assigned to the same
grid cell. All pairwise distances of points assigned to the same cell are the same.
Clustering using this Baire distance has been successfully applied to areas
such as chemoinformatics (Murtagh et al., 2008), astronomy (Contreras and
Murtagh, 2009) and text retrieval (Contreras, 2010).
Conclusions
Hierarchical clustering methods, with roots going back to the 1960s and 1970s,
are continually replenished with new challenges. As a family of algorithms they
are central to the addressing of many important problems. Their deployment in
many application domains testifies to how hierarchical clustering methods will
remain crucial for a long time to come.
We have looked at both traditional agglomerative hierarchical clustering,
and more recent developments in grid or cell based approaches. We have discussed various algorithmic aspects, including well-definedness (e.g. inversions)
and computational properties. We have also touched on a number of application
domains, again in areas that reach back over some decades (chemoinformatics)
or many decades (information retrieval, which motivated much early work in
clustering, including hierarchical clustering), and more recent application domains (such as hierarchical model-based clustering approaches).
10
References
16
16. Gan G, Ma C and Wu J Data Clustering Theory, Algorithms, and Applications Society for Industrial and Applied Mathematics. SIAM, 2007.
17. Gillet VJ, Wild DJ, Willett P and Bradshaw J Similarity and dissimilarity
methods for processing chemical structure databases Computer Journal
1998, 41: 547558.
18. Gondran M Valeurs propres et vecteurs propres en classification hierarchique
RAIRO Informatique Theorique 1976, 10(3): 3946.
19. Gordon AD Classification, Chapman and Hall, London, 1981.
20. Gordon AD A review of hierarchical classification Journal of the Royal
Statistical Society A 1987, 150: 119137.
21. Grabusts P and Borisov A Using grid-clustering methods in data classification, in PARELEC 02: Proceedings of the International Conference
on Parallel Computing in Electrical Engineering.Washington, DC: IEEE
Computer Society, 2002.
22. Graham RH and Hell P On the history of the minimum spanning tree
problem Annals of the History of Computing 1985 7: 4357.
23. Griffiths A, Robinson LA and Willett P Hierarchic agglomerative clustering methods for automatic document classification Journal of Documentation 1984, 40: 175205.
24. Hinneburg A and Keim DA, A density-based algorithm for discovering
clusters in large spatial databases with noise, in Proceeding of the 4th
International Conference on Knowledge Discovery and Data Mining. New
York: AAAI Press, 1998, pp. 5868.
25. Hinneburg A and Keim D Optimal grid-clustering: Towards breaking the
curse of dimensionality in high-dimensional clustering, in VLDB 99: Proceedings of the 25th International Conference on Very Large Data Bases.
San Francisco, CA: Morgan Kaufmann Publishers Inc., 1999, pp. 506517.
26. Jain AK and Dubes RC Algorithms For Clustering Data Prentice-Hall,
Englwood Cliffs, 1988.
27. Jain AK, Murty, MN and Flynn PJ Data clustering: a review ACM Computing Surveys 1999, 31: 264323.
28. Janowitz, MF Ordinal and Relational Clustering, World Scientific, Singapore, 2010.
29. Juan J Programme de classification hierarchique par lalgorithme de la
recherche en chane des voisins reciproques Les Cahiers de lAnalyse des
Donnees 1982, VII: 219225.
18
20
60. Wang Y, Freedman M.I. and Kung S.-Y. Probabilistic principal component subspaces: A hierarchical finite mixture model for data visualization,
IEEE Transactions on Neural Networks 2000, 11(3): 625636.
61. White HD and McCain KW Visualization of literatures, in M.E. Williams,
Ed., Annual Review of Information Science and Technology (ARIST) 1997,
32: 99168.
62. Vicente D and Vellido A review of hierarchical models for data clustering
and visualization. In R. Gir
aldez, J.C. Riquelme and J.S. Aguilar-Ruiz,
Eds., Tendencias de la Minera de Datos en Espa
na. Red Espa
nola de
Minera de Datos, 2004.
63. Willett P Efficiency of hierarchic agglomerative clustering using the ICL
distributed array processor Journal of Documentation 1989, 45: 145.
64. Wishart D Mode analysis: a generalization of nearest neighbour which reduces chaining effects, in Ed. A.J. Cole, Numerical Taxonomy, Academic
Press, New York, 282311, 1969.
65. Xu R and Wunsch D Survey of clustering algorithms IEEE Transactions
on Neural Networks 2005, 16: 645678.
66. Xu R and Wunsch DC Clustering IEEE Computer Society Press, 2008.
67. Xu X, Ester M, Kriegel H-P and Sander J A distribution-based clustering
algorithm for mining in large spatial databases, in ICDE 98: Proceedings
of the Fourteenth International Conference on Data Engineering. Washington, DC: IEEE Computer Society, 1998, pp. 324331.
68. Xu X, Jager J, and Kriegel, H-P A fast parallel clustering algorithm for
large spatial databases, Data Mining Knowledge Discovery 1999, 3(3):
263290.
69. Zaane OR and Lee C-H, Clustering spatial data in the presence of obstacles: a density-based approach, in IDEAS 02: Proceedings of the
2002 International Symposium on Database Engineering and Applications.Washington, DC: IEEE Computer Society, 2002, pp. 214223.
21