Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

Supervised Kmeans-08

This paper proposes a supervised learning approach to train k-means clustering using structural SVMs. Given training data of item sets with their correct clusterings, the goal is to learn a similarity measure such that k-means produces similar clusterings on new data. The paper presents two efficient variants of this supervised k-means learning - one based on spectral relaxation and one based on the traditional k-means algorithm. Theoretical and empirical analysis shows the methods improve clustering accuracy over non-learning approaches.

Uploaded by

Mohammad Muskaan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Supervised Kmeans-08

This paper proposes a supervised learning approach to train k-means clustering using structural SVMs. Given training data of item sets with their correct clusterings, the goal is to learn a similarity measure such that k-means produces similar clusterings on new data. The paper presents two efficient variants of this supervised k-means learning - one based on spectral relaxation and one based on the traditional k-means algorithm. Theoretical and empirical analysis shows the methods improve clustering accuracy over non-learning approaches.

Uploaded by

Mohammad Muskaan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Supervised k-Means Clustering

Thomas Finley Thorsten Joachims


Department of Computer Science Department of Computer Science
Cornell University Cornell University
Ithaca, NY, USA Ithaca, NY, USA
tomf@cs.cornell.edu tj@cs.cornell.edu

ABSTRACT that for a given set of noun phrases occuring in a document,


The k-means clustering algorithm is one of the most widely those that refer to the same entity in the world are indeed
used, effective, and best understood clustering methods. How- clustered into the same cluster. Unfortunately, hand-tuning
ever, successful use of k-means requires a carefully chosen the similarity measure is difficult, since it is unclear how
distance measure that reflects the properties of the cluster- changes in the similarity measure relate to the behavior of
ing task. Since designing this distance measure by hand is the k-means algorithm.
often difficult, we provide methods for training k-means us-
ing supervised data. Given training data in the form of sets In this paper we propose a supervised learning approach
of items with their desired partitioning, we provide a struc- to finding a similarity measure so that k-means provides
tural SVM method that learns a distance measure so that the desired clusterings for the task at hand. Given train-
k-means produces the desired clusterings. We propose two ing examples of item sets with their correct clusterings, the
variants of the methods – one based on a spectral relaxation goal is to learn a similarity measure so that future sets of
and one based on the traditional k-means algorithm – that items are clustered in a similar fashion. In particular, we
are both computationally efficient. For each variant, we pro- provide a structural support vector machine (SSVM) algo-
vide a theoretical characterization of its accuracy in solving rithm for this supervised k-means learning problem, capa-
the training problem. We also provide an empirical cluster- ble of directly optimizing a parameterized similarity mea-
ing quality and runtime analysis of these learning methods sure to maximize cluster accuracy. We show theoretically
on varied high-dimensional datasets. and empirically that the algorithm is efficient, and that
it provides improved clustering accuracy compared to non-
Categories and Subject Descriptors learning methods, as well as compared to more naive ap-
proaches to this supervised clustering problem.
I.2.6 [Artificial Intelligence]: Learning—induction, pa-
rameter learning; I.5.3 [Pattern Recognition]: Cluster-
ing—algorithms, similarity measures 2. RELATED WORK
Supervised clustering is the task of automatically adapting
Keywords a clustering algorithm with the aid of a training set con-
Support Vector Machines (SVM), Training Algorithms, Clus- sisting of item sets and complete partitionings of these item
tering sets. Past applications of supervised clustering include im-
age segmentation [1], news article clustering, noun-phrase
1. INTRODUCTION co-reference [10], and streaming email batch clustering [11].
Clustering is an important data mining task employed in These examples are similar to this work insofar as they learn
dataset exploration and in other settings where one wishes a parameterized item-pair similarity from complete parti-
to partition sets into related groups. Among the algorithms tions of item sets. However, there are important differences.
typically used for clustering, k-means is arguably one of the Methods of [10, 11] provide a structural SVM based super-
most widely used and effective clustering methods. Suc- vised clustering, but the underlying method is correlation
cessful use of k-means, however, requires a carefully chosen clustering [2] rather than k-means. The [1] method learns
similarity measure that must be constructed to fit the task similarity measures for spectral clustering; differences in for-
at hand. For examples, in Noun-Phrase Co-Reference Reso- mulations aside, this method requires a special optimization
lution (see e.g., [16]), one must select a similarity measure so procedure and is tightly coupled to a relaxed version of spec-
tral clustering, rather than being able to optimize to both
relaxed and discrete k-means clusterers.

A related field is semi-supervised clustering, where it is com-


mon to also learn a parameterized similarity measure [3, 4,
6, 15]. However, this learning problem is markedly different
from supervised clustering. In semi-supervised clustering,
the user has a single large dataset to cluster, with incom-
Under Submission ’08 Please Do Not Distribute plete information about clustering, usually in the form of
pairwise constraints about cluster membership. This differ-
ence leads to very different algorithms in the two settings. In this work, we assume for any K that the associated x and
w are obvious in context.
3. PARAMETERIZED K-MEANS
In this section we shall introduce the k-means clustering al- Work in kernel k-means clustering often specifies that K is
gorithm, and then describe increasingly complex parameter- symmetric positive semi-definite, i.e., K º 0 [8]. Why? The
izations of k-means that allows us to adjust the clusterings items in the set x have representations in some (implicit)
k-means produces through supervised learning. vector space if and only if K º 0 [15]. This is relevant to
our setting, since the proof of convergence for batch k-means
The k-means clustering algorithm is classically described as clustering depends on the existence of this space, and may
taking an input set x of m items, x1 , x2 , . . . , xm , where each not converge without it [15].
item xi has some corresponding vector ψi ∈ RN .1 A clus-
tering algorithm computes some clustering y of x with k How can we ensure K º 0? Consider an alternate definition
clusters so as to minimize intracluster Euclidean distance of K. For a given x, let K (`) ∈ Rm×m be the matrix of the
(`)
over these ψi , i.e., `th pairwise feature in pairwise ψij , i.e., Kij = he` , ψij i. We
PN
XX‚ P ‚2 (`)
may then define K as K = `=1 w` K . Restricting w ≥ 0
‚ ψ ‚
argmin ‚ψi − xj ∈c j ‚ . (1) and all K (`) º 0 will imply K º 0, since non-negative lin-
y
‚ |c| ‚
c∈y xi ∈c 2 ear combinations of symmetric positive semi-definite (SPSD)
Algebraic manipulation reveals this minimization is equiva- matrices are likewise SPSD. This style of parameterization
lent to finding y to maximize has strong connections to the field of kernel learning [15].
X 1 X
argmax hψi , ψj i (2) Enforcing w ≥ 0 is the responsibility of the training pro-
y
c∈y
|c| i,j∈c cedure, but the constraint on the features in the pairwise
ψij is the responsibility of the practitioner providing these
in a form often called kernel k-means [8]. vectors. Fortunately, this is usually not difficult to satisfy.
For example, the very common case with pairwise vectors
How can we parameterize this (2) objective function to pro- ψij = ψi ◦ ψj seen in (5) satisfies the constraint. More gen-
vide a family of similarity measures for learning? A sim- erally, any features in ψij whose values comes from kernel
ple but powerful parameterization is to provide some linear function evaluation over items xi , xj ∈ x satisfy the con-
weighting w ∈ RN to distort the ψi dimensions: straint.
X 1 X T
argmax ψi diag(w)ψj . (3)
y
c∈y
|c| i,j∈c 3.2 Similarity Learning Parameterizations
The restrictions to enforce K º 0 pose practical disadvan-
We can alternately phrase (3) as tages. First, for the user providing ψij pairwise feature vec-
X 1 X tors, ensuring that every single feature is a kernel may be
argmax hw, ψi ◦ ψj i . (4) difficult in some settings. Second, enforcing positivity con-
y
c∈y
|c| i,j∈c
straints on w is bothersome insofar as it may complicate the
Here, ◦ is the componentwise vector product. By changing parameter learning procedure, and it is even unhelpful: it
weights in w, we affect what clustering y of x is optimal is plausible that some pairwise features are negatively corre-
under this parameterized k-means objective (4). lated with common cluster membership. To take a canonical
example, if one is clustering web pages, certain link relation-
ships among pages are often strong indicators that pages are
3.1 Kernel Learning Parameterizations of different types [14]. With some effort, tricks may be em-
Though formulation of (4) is simple, it is a somewhat limited
ployed to overcome some of these difficulties (for example,
parameterization insofar as it requires that points explicitly
doubling features with positive and negative versions of the
exist in a vector space. To begin to generalize this, suppose
features to allow negative correlations, and diagonal offsets
instead of ψi ◦ ψj , that any pair xi , xj in x has a correspond-
large enough to ensure K º 0), but this is troublesome and
ing pairwise vector ψij ∈ RN .
often confusing.
X 1 X
argmax hw, ψij i . (5)
y
c∈y
|c| i,j∈c To avoid these problems, the alternative to Section 3.1’s re-
strictions is to simply lift them, i.e., accept any ψij pairwise
If we then define a matrix K ∈ Rm×m with entries vectors and parameterization w. The cost of this greater
simplicity and flexibility is that the resulting K is often no
Kij = hw, ψij i (6)
longer SPSD.2 This is not a major problem, but it does re-
we can view (5) as strict us to clustering algorithms robust to K 6º 0.
X 1 X
argmax Kij . (7) 3.3 Nonlinear Parameterizations
y
c∈y
|c| i,j∈c
The preceding discussion has considered w to be a real vec-
1
To avoid confusion, note that by k-means we refer to the tor w ∈ RN , but it may also be considered a non-linear pa-
general problem of trying to minimize (1), and emphati- rameterization vector. We may view w as a linear combina-
cally not to any one particular instantiation of search pro-
2
cedure that attempts to solve this problem, e.g., batch k- Though “kernel k-means” becomes a bit of a misnomer in
means, point-iterative k-means, or the spectral clustering this case, we retain its use, as the name for the representa-
algorithms. tion is an established term.
P
tion of pairwise vectors seen in training, i.e., w = î,ĵ αîĵ ψîĵ . Algorithm 1 Cutting plane algorithm to solve OP 1.
In this case, our parameterized pairwise similarity score be- 1: Input: (x1 , y1 ), . . . , (xn , yn ), C, ²
comes 2: Si ← ∅ for all i = 1, . . . , n
X ˙ ¸ 3: repeat
Kij = hw, ψij i = αî,ĵ ψîĵ , ψij (8)
4: for i = 1, . . . , n do
î,ĵ
˙ ¸ 5: H(y) ≡ ∆(yi , y) + hw, Ψ(xi , y)i − hw, Ψ(xi , yi )i
and we may replace the inner product ψîĵ , ψij with some 6: compute ŷ = argmaxy∈Y H(y)
kernel function κ(ψîĵ , ψij ). This allows parameterizations to 7: compute ξi = max{0, maxy∈Si H(y)}
capture complex non-linear interrelationships among pair- 8: if H(ŷ) > ξi + ² then
wise features. 9: Si ← Si ∪ {ŷ} S
10: w ← optimize primal over i Si
4. SUPERVISED K-MEANS WITH SSVMS 11: end if
With k-means parameterization defined as above, how do 12: end for
we actually learn a parameterization? We provide a super- 13: until no Si has changed during iteration
vised approach based on structural support vector machines,
taking as input a training set
plane technique in Algorithm 1 can be used to efficiently to
S = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}. solve OP 1 to arbitrary precision ². This algorithm itera-
Each xi ∈ X is a set of items and yi ∈ Y a complete tively finds the most violated constraintSwith a separation
partitioning of that set. For example, S could have xi as oracle (line 6), adds it to a working set i Si if violated by
noun-phrases in a document and yi as the partitioning into more than desired precision ² (line 9), and resolves the QP
co-referent sets, or xi as images with yi as similar segments to find a new parameterization w (line 10). Algorithm 1 ter-
within the image, etc. The output of the learning algorithm minates when no new constraint is found, that is, when all
is a w-parameterized hypothesis h : X → Y, where the clus- constraints in OP 1 are satisfied within ². We will discuss the
tering algorithm in h uses the w parameterized similarity computational complexity and accuracy of this algorithm for
measure when clustering inputs x. Intuitively, the goal is to supervised k-means learning in Section 5.
learn some w so that each h(xi ) is close to yi on the train-
ing set, and so that h predicts the desired clustering also for To use structural SVMs to learn parameterizations for k-
unseen sets of items x. means clustering, we must (1) state our clustering procedure
h(x) in terms of h(x) = argmaxy hw, Ψ(x, y)i, (2) provide a
loss function ∆(y, ŷ), and (3) provide the separation oracle
4.1 Structural SVMs argmaxy∈Y hw, Ψ(xi , y)i + ∆(yi , y). These are explained in
Structural SVMs are a general method for learning hypothe- the following three sections.
ses with complex structured output spaces [20]. From a
training set S = ((x1 , y1 ), . . . , (xn , yn )), a structural SVM
learns a hypothesis h : X → Y mapping inputs x ∈ X to 4.2 Combined Feature Function Ψ
outputs y ∈ Y, trading off model complexity and empirical We must express h(x) as h(x) = argmaxy hw, Ψ(x, y)i. Work-
risk. A hypothesis takes the form ing from (7) and (6),
X 1 X
h(x) = argmax f (x, y), (9) h(x) = argmax Kij
y∈Y y∈Y
c∈y
|c| i,j∈c
maximizing a discriminant function f : X × Y → R with X 1 X
≡ argmax hw, ψij i
f (x, y) = hw, Ψ(x, y)i . (10) y∈Y
c∈y
|c| i,j∈c
* +
The Ψ combined feature vector function relates inputs and X 1 X
outputs, and w is the model parameterization learned from ≡ argmax w, ψij .
y∈Y
c∈y
|c| i,j∈c
S. The quality of hypotheses is evaluated using a loss func-
tion ∆ : Y × Y → R describing the extent to which two So, Ψ(x, y) is
outputs differ. The Ψ and ∆ functions are task dependent. X 1 X
Ψ(x, y) = ψij (13)
Structural SVMs find a w thatPbalances model complexity c∈y
|c| i,j∈c

and empirical risk RS (h) = n1 n
i=1 ∆(yi , h(xi )) by solving
this quadratic program (QP) [20]: for the most general parameterization of k-means.

In this work, we also want to represent and learn from “re-


Optimization Problem 1. (Structural SVM) laxed” clusterings, such as those that appear in methods
n like spectral clustering. More specifically, we shall provide
1 CX a matrix representation of clusterings. Consider this alter-
min kwk2 + ξi (11)
w,ξ≥0 2 n i=1 nate representation of clusterings y: for each partitioning y
∀i, ∀y ∈ Y \yi: hw,Ψ(xi ,yi)i ≥ hw,Ψ(xi ,y)i+∆(yi ,y)−ξi . (12) of m items into k clusters, let Y ∈ Rm×k be an equivalent
alternate matrix representation of the clustering. Each col-
umn in Y corresponds to some cluster c ∈ y, where each
Introducing constraints for all possible outputs is typically element i in the column is |c|−0.5 if i ∈ c, and is 0 otherwise.
intractable. However, it has been shown that the cutting For example, the following two clustering representations are
equivalent: The ∆ loss function for the dissimilarity between two clus-
terings we use in this work is
2 1 3 „ «

2
0 1 T T
6 0 √1 7 ∆(Y, Ŷ) = 100 · 1 − trace(Y ŶŶ Y) (16)
6 3 7 k
6 0 7 „ «
y = {{1, 3}, {2, 4, 5}} Y = 6 √12 7.
6 7 1
4 0 √1 5 = 100 · 1 − kY T Ŷk2F . (17)
3 k
0 √1
3
For Y corresponding to a discrete partitioning y, (16) equals
0 1
More formally, any matrix Y corresponding to a discrete X X |c∩ĉ|2
clustering y will obey two basic constraints. First is col- ∆(y, ŷ) = 100 · @1 − k1 A. (18)
|c|·|ĉ|
umn orthonormality: for any columns Y:,i or Y:,i from Y, c∈y ĉ∈y
T
kY:,i k2 = 1, Y:,i Y:,j = 0, i.e., Y T Y = I. Second is the
requirement that a columns nonzero entries are equal: for This loss ∆ has attractive qualities. It is symmetric and
any pair of column Y:,i ’s entries Yj,i 6= 0 and Y`,i 6= 0, invariant to column rearrangements. Also, as seen in (18), ∆
Yj,i = Y`,i . Third is that there are no negative entries: any essentially counts agreement among pairs of items in clusters
entry Yj,i ≥ 0. which is normalized by the size of the clusters in question.
This is favorable relative to alternate loss functions based on
With this new representation Y, we may rephrase (7) as: the Rand index [17] used in previous supervised clustering
work [10]: where this normalization is absent, loss becomes
heavily biased against mistakes in larger clusters. Finally,
argmax trace(Y T KY). (14)
Y though any judgment about the appropriateness of a loss
function must necessarily be subjective, this ∆ appears to
give qualitatively sensible judgments about the similarity of
We can phrase the objective in terms of (10) to get Ψ(x, Y): two clusterings.

h(x) = argmax trace(Y T KY) 4.4 Separation Oracle and Prediction


Y
* + For the separation oracle argmaxy∈Y hw, Ψ(xi , y)i+∆(yi , y),
Xm X m “ ”
T
the form of ∆ is well suited to constructing the separation
≡ argmax w, Yi,: Yj,: ψij . oracle: one can employ a clustering algorithm as the sepa-
Y
i=1 j=1 ration oracle and cluster over the matrix (K − k1 YY T ) in
place of K in the (7) objective.
So, Ψ(x, Y) is
However, finding the actual clustering y that globally max-
i−1 “
m X
X ” imizes (7) either for prediction or computing the most vio-
T
Ψ(x, Y) = Yi,: Yj,: ψij . (15) lated constraint is an NP-hard problem. This has led to the
i=1 j=1 adoption of many varied approximate algorithms to maxi-
mize this objective function. The survey in [8] characterizes
Note that (15) generalizes (13) insofar as the two are equal many of the popular clustering algorithms that approximate
for any Y corresponding to y, and (15) is defined for any the maximization of the discriminant function (7). We use
spectral output Y. three methods from that paper that are all robust to K 6º 0.
We denote these differing methods as Iterative, Spectral,
As an aside, that Ψ(x, Y) is quadratic in the entries of Y and Discrete. In prediction, one could use other clustering
brings up a subtle but important distinction about the gen- methods if one conformed to SPSD restrictions on K as de-
erality of structural SVMs versus alternative formulations of fined in Section 3.1, including batch k-means, normalized
OP 1, like max-margin Markov nets (M3 N) [19] and associa- cut algorithms, etc. In the separation oracle, however, we
tive Markov nets and their variants [18]. These alternatives must use these robust methods: even with K º 0, it is quite
require that “inference” (in this case, k-means clustering) possible that (K − k1 YY T ) 6º 0.
be phrased as either a Markov random field or linear pro-
gram, respectively. One could begin to express the quadratic 4.4.1 Iterative Point-Incremental Clustering
nature of Y as pairwise cliques in an MRF for M3 N, or lin- Iterative is point-incremental k-means [7]. We use point-
earize clustering by optimizing Z = YY T for associative incremental (i.e., recomputing cluster centers with each point
networks. However, these methods would be incapable of reassignment) and not standard batch (i.e., recomputing
feasibly capturing that Y must have orthonormal columns, cluster centers after a pass over all points) k-means since
or the rank(Z) = k constraint on Z. In contrast, the re- K easily becomes non-SPSD without positivity constraints
striction of the structure and number of columns of Y, the on w’s elements, breaking batch k-means’ convergence guar-
nonlinearity of Y in Ψ, and the nonlinearity of the clus- antees.
tering procedure are all incidental and naturally expressed
in structural SVMs since the structure of Ψ(x, y) is unre- The algorithm works by randomly assigning all m items to
stricted. k clusters, and then iterating over all points, reassigning
them to the cluster with the “closest” cluster center. Unlike
typical batch k-means clustering which waits until a pass
4.3 Loss Function ∆ is completed before updating cluster centers, point-iterative
k-means updates the centers upon each point reassignment. they use the discrete clusterers Iterative/Discrete, or relaxed
Compared to batch k-means, point-iterative k-means does Spectral.
not depend upon K º 0 and tends to produce clusterings
with lower intracluster distance [7]. The methods Iterative and Discrete may return some sub-
optimal clustering, i.e., some clustering ŷ such that
4.4.2 Spectral Clustering hw, Ψ(xi , ŷ)i + ∆(yi , ŷ) < hw, Ψ(xi , y∗ )i + ∆(yi , y∗ ).
Spectral is a straightforward eigenanalysis of K to pro-
duce a “relaxed” clustering in the matrix representation Y In such a suboptimal case, constraints violated by more than
described in Section 4.2. If we relax of Section 4.2’s con- ² in OP 1 may go undetected by Algorithm 1, leading to
straints on Y except for having orthonormal columns, then termination with a solution infeasible in OP 1. In other
this optimization problem words, the problem becomes underconstrained.
argmax trace(Y T KY)
Y The method Spectral is a very different animal. Rather than
searching Y for local maxima, it instead searches some re-
over this multi-vector Rayleigh quotient may be maximized
laxed Y space which it can efficiently search for a global
by assigning Y’s columns as the eigenvectors corresponding
maximum. In this case, Y is the space of all indicator ma-
to the k-largest eigenvalues of K. This eigenvector matrix
trices Y where the special structure of entries described
is a relaxed “clustering” in that we have relaxed the require-
in Section 4.4.2 is abandoned, save for the requirement of
ments for the special structure of Y listed in Section 4.2 that
orthonormal columns. More to the point, Y ⊂ Y, and
ensured it corresponded to some discrete clustering y.
because the separation oracle searches over Y, at the end
of Algorithm 1 we not only shall have all constraints in
4.4.3 Discretized Spectral Clustering OP 1 respected, but additional constraints from outputs
Discrete is a discretized spectral method via Bach and Jor- Y ∈ (Y − Y). The solution is feasible but probably sub-
dan post-processing [1], and is a combination of the previous optimal in OP 1. The problem becomes overconstrained.
methods: once we have our eigenvector matrix Ȳ, we clus-
ter K̄ = Ȳ Ȳ T with point-incremental k-means to find a Either underconstrained or overconstrained learning has its
discrete y. unique costs. With underconstrained learning, since con-
straints in OP 1 may be violated, slack no longer bounds
5. THEORETICAL ANALYSIS empirical risk, thus eroding one of the basic principles of
Structural SVMs have three major important theoretical SVM learning. On the other hand, with overconstrained
characteristics, including polynomial time termination in the learning, Algorithm 1 solves a problem which accounts for
number of iterations of Algorithm 1, P correctness insofar as outputs that would never arise from a discrete clustering al-
Algorithm 1 solves OP 1, and that n1 n i=1 ξi upper bounds
gorithm, thus unnecessarily ruling out parameterizations w
empirical risk [20]. We will now discuss how far they hold which may yield superior performance. It is unclear theoret-
for supervised k-means algorithms. ically whether either way is better, so our experiments shall
provide an empirical evaluation of both underconstrained
There is one subtle but important point that arises from and overconstrained learning.
using approximations in the separation oracle: the known
performance guarantees for Algorithm 1 are known to apply 6. EMPIRICAL ANALYSIS
only to the case where the separation oracle argmaxy∈Y H(y)
We implemented supervised k-means clustering with the
is calculated exactly [20]. In Section 4.4 we constructed our
SVMpython structural SVM package [9]. The module’s code,
separation oracle from a clustering algorithm, but because
instructions and examples of use, as well as the datasets that
clustering algorithms are approximations, this may not find
we used in our experiments, are accessible from:
the globally optimal y. What can we still guarantee about
http://www.cs.cornell.edu/~tomf/projects/supervisedkmeans/.
our supervised k-means algorithms?
To empirically analyze our methods, we compare it to naively
Consider the space of possible clusterings Y for training
trained and untrained clusterers, and also provide compar-
example (xi , yi ). During training, the ideal clusterer sep-
isons of our methods using underconstrained and overcon-
aration oracle would find the true maximizing clustering
strained learning on real and synthetic datasets. Parame-
y∗ = argmaxy∈Y hw, Ψ(xi , y)i+∆(yi , y). (To reiterate, un-
terizations w and pairwise vectors ψij are unconstrained as
der this ideal case, Algorithm 1 is guaranteed to solve OP 1.)
outlined in Section 3.2, i.e., not requiring K º 0.
However, this ideal is unrealizable. So what happens when
we use one of our approximations?
In all experiments, pairwise feature vectors ψij are composed
from “node” features vectors ψ̄i , ψ̄j ∈ RNn and an explicitly
Let us first consider polynomial time termination. The poly-
provided pairwise feature vector ψ̄ij ∈ RNp such that
nomial time termination guarantee still holds, since the proof
» –
does not depend on the quality of the solution, but rather on ψ̄i ◦ ψ̄j
the idea that any constraint violated by more than ² must ψij = .
ψ̄ij
increase the objective by some minimum amount [20].
Pairwise feature vectors ψij are in RN where N = Nn +
Correctness and empirical risk are less easy to deal with. Np , and correspondingly we have w ∈ RN . Some datasets
The separation oracles can be divided into two broad cate- evaluated have no node or explicit pairwise features, i.e.,
gories according to what they do solve, depending on whether sometimes Nn = 0 or Np = 0.
Table 1: Dataset statistics, including number of ex- Table 2: Range of C values tested during the LOO
ample clusterings n, number of clusters k in each search for training hyperparameters. All powers of
example clustering, average number of points m in ten between and including these endpoints were con-
the clusterings, node features Nn , and pairwise fea- sidered.
tures Np . (The SSVM learns N = Nn + Np weights in
Dataset Low C High C Dataset Low C High C
w.) WebKB-L 1·10−1 1·104 News 1·100 1·105
WebKB-N 1·100 1·105 Synth 1·10−2 1·103
Dataset n k Avg. m Nn Np
WebKB-L 4 6 1041 50397 100796
WebKB-N 4 6 1041 41131 0
News 8-1 7 10 150 0 30
News 8-2 7 10 150 0 30 and 4 of 2004) to get three datasets (News 8-1, News 8-2,
News 8-4 7 10 150 0 30
Synth 5 5 100 0 750
and News 8-4).

6.1.3 Synth Dataset


Synth is a synthetic dataset meant to emphasize the im-
6.1 Datasets portance of some features being harmful and others helpful,
We used three general “families” of datasets in our empirical in the face of significant noise. It was generated in this way:
analysis, from which we drew one or more specific evaluation there are 5 clusters, each with 20 points. Between every
datasets. The datasets are listed in Table 1. pair of the 100 points is a pairwise feature vector. This
pairwise feature vector is comprised of 15 “regions” (one for
6.1.1 WebKB Dataset each possible cluster pair), each region with 50 features (so
WebKB consists of web pages retrieved from the computer 750 pairwise features total). For a pair of points in clusters
science departments of four universities, labeled as being i and j, the feature “region” corresponding to i, j will have 5
a course web page, faculty page, student page, etc [5]. It of the 50 features active. Also, noise is introduced for each
is often used in classification and multiclass classification pairwise feature vector3 : instead of consistently indexing the
tasks that seek to exploit the link structure among the web region (i, j), it will 20% of the time replace i with a random
documents. In our experiment, we effectively turned this cluster (so 16% of the time it will differ from i), and the
into two closely related datasets. same for j. So, only about 70.5% of pairwise vectors have
the “correct” index. Only one dataset was generated.
One of these datasets contains only node features (WebKB-
N) as TFIDF-scaled unigram word count vectors. There are 6.2 Experimental Setup
no pairwise features. To evaluate performance, we trained k-means parameteri-
zations on our dataset. For each dataset of n clustering
The other dataset (WebKB-L) contains these word count examples, we ran n experiments, where each clustering was
features and additional features relating to the relationships taken in turn as the single example “test set” with the n − 1
among these documents, and also critically a pairwise fea- remaining clusterings as the training set. For each exper-
ture vector with two regions, corresponding to documents iment, LOO cross validation was used on the n − 1 size
where one document links to another, and another where training set to choose the two training hyperparameters: C
both are linked from the same document (co-citation). If (values drawn from a sample of powers of 10 seen in Table 2),
documents are linked or co-cited, the respective region in and which classifier to use as the final predictor (Iterative,
the pairwise feature vector will contain the componentwise Spectral, or Discrete).
product of the node features, plus a single 1 indicator fea-
ture. If they are not linked or co-cited, the corresponding The parameterizations were trained with Iterative and Spec-
region is zeroed. tral separation oracle supervised k-means trainers. In ad-
dition to these supervised k-means clustering methods, we
6.1.2 News Dataset have two baselines.
News is a dataset related to the news article clustering
dataset of [10]. The sets of items and partitioning was col- Pair is a model training method based on binary classifiers
lected through trawling Google News for one day and ex- by taking all pairwise feature vectors, considering whether
tracting the text of news articles from the linked news sites. the associated pair is in the same cluster, and treating it as
Google News has seven major areas (Business, Entertain- a binary classification problem trained for accuracy. During
ment, Health, Nation, Sports, Technology, World). Each classification, entries in the similarity matrix K are outputs
area serves as a clustering, with each individual news story of the learned binary classifier. This style of supervised clus-
comprising the individual clusters within the area, and the tering using binary classifiers has been successfully used in
individual articles within each story being the items within work on noun-phrase coreference resolution [16]. The result-
the cluster. The data for all points is expressed as a pairwise ing training method differs from supervised k-means cluster-
feature vector, where each feature is the cosine similarity of ing insofar as the clustering procedure and desired ∆ are not
TFIDF weighted token vectors, where these token vectors considered in training, but it will still try to increase or de-
are unigrams, bigrams, and trigrams of text in the title, ar- crease the similarity of pairs in or out of the same cluster, re-
ticle text, and quoted sections of the article text, in both 3
Without noise, learned clusterers produced perfect cluster-
original and Porter stemmed versions of the features, for 30 ings. While useful as a sanity check, it makes for uninter-
features in all. We sampled from three days (August 1, 2, esting comparisons.
spectively. Hyperparameters (C and clusterer in prediction)
were selected in an identical fashion to supervised k-means Table 3: Loss ∆ on various datasets (lower is bet-
clustering. ter). The left columns identify the dataset and the
particular clustering used as the test dataset in the
None is a second baseline, which consists of Iterative clas- corresponding row.
sification with all equal weights, that is, no training at all.
Dataset Test Clustering Iterative Spect Pair None
WebKB-L Cornell 45.3 53.3 79.7 74.7
6.3 Clustering Accuracy Texas 59.8 56.7 78.9 72.8
Table 3 details the loss figures resulting from training the Washington 53.1 46.6 60.6 76.2
clusterer with the Iterative and Spectral separation oracle Wisconsin 47.3 60.2 81.1 77.5
WebKB-N Cornell 63.0 61.4 74.8 78.6
(columns Iterative and Spect), training the clusterer with Texas 69.9 56.8 75.5 78.7
the pairwise binary classifier (column Pair), and with no Washington 68.8 58.2 74.9 78.3
training (column None). While loss ∆ values can reach 100, Wisconsin 72.6 66.2 77.0 78.6
News 8-1 Business 23.7 20.6 45.2 49.5
a more reasonable upper bound is k−1k
·100, the loss resulting Entertainment 12.7 22.2 53.0 25.9
from putting all points together in 1 cluster. Health 28.1 28.7 57.4 38.8
Nation 3.8 3.8 40.2 14.6
Sports 15.2 14.3 47.6 59.9
6.3.1 Supervised Clustering vs. Pairwise/Untrained Technology 35.9 30.4 51.7 37.3
How do efforts to do any supervised k-means clustering com- World 3.7 2.4 41.7 62.1
News 8-2 Business 3.6 4.6 34.1 63.8
pare against the more naive pairwise binary training? On Entertainment 22.7 9.5 40.1 22.8
the WebKB-L, WebKB-N and News datasets, the perfor- Health 20.4 20.4 48.4 43.9
mance gains from structural SVM training in ∆ figures are Nation 24.6 23.7 47.4 60.6
Sports 20.2 15.8 59.3 57.0
quite dramatic, and both Iterative and Spectral trained su- Technology 16.1 13.8 48.3 41.3
pervised k-means clustering methods outperform these base- World 12.2 11.9 50.5 70.4
lines on these datasets every time. News 8-4 Business 19.7 14.9 42.7 33.5
Entertainment 4.6 6.3 46.8 32.4
Health 15.0 16.2 51.7 32.1
The relationship on Synth is somewhat different: while there Nation 19.4 20.3 41.2 30.0
are differences, the pairwise trained model even “wins” once Sports 19.0 19.0 55.6 54.7
(testing on cluster 3), and the extent to which each ei- Technology 5.8 11.6 46.4 37.6
World 4.8 5.8 39.6 39.3
ther class of supervised k-means clustering models wins is Synth 1 43.3 55.6 48.1 74.7
not conclusively better statistically speaking. Why does 2 53.4 58.7 54.7 74.7
this happen? One important power of supervised cluster- 3 56.0 56.7 55.2 74.7
4 39.3 59.5 43.9 74.7
ing methods is their ability to exploit cluster structure: two 5 40.3 63.4 49.1 74.7
items i, j ∈ x with low similarity Kij can still be in the
same cluster owing to the effect of other items in x. In
contrast, the baseline pairwise classifier treats all judgments
on pairwise φij independently. However, since all φij are
generated independently in the synthetic dataset and there The WebKB-L dataset differs from WebKB-N in that it con-
is no long range dependency structure to exploit, pairwise tains pairwise features relevant to the hyperlink structure
classification for training w works fine. in the corpus, whereas WebKB-N are straightforward docu-
ment vectors. Each of the 8 supervised k-means clustering
The untrained model does quite poorly in Synth, but this is WebKB-L trained models outperform their corresponding
expected since the dataset was generated specifically to con- WebKB-N trained model. While 8 wins to 0 losses is statis-
tain large numbers of pairwise features correlated negatively tically significant under a sign test, these loss ∆ figures are
with co-cluster membership. not independent; nevertheless, the magnitude of the differ-
ences, always over 10 in the case of Iterative trained models,
suggests a substantial gain. As the usefulness of exploiting
6.3.2 Discrete Iterative vs. Relaxed Spectral hyperlink structure in WebKB is a feature of most papers
How does discrete Iterative compare against the relaxed featuring this dataset, it is important that our methods are
Spectral when used as a separation oracle during training? able to handle definitions of these general pairwise features.

We use non-parametric tests like Fisher sign or Wilcoxon


signed-rank tests. Whie the loss figures are not independent 6.4 Computation Time
since they result from shared training sets, we accept these Clustering performance aside, how does training time de-
non-parametric tests as an imperfect measure that never- pend on characteristics of the dataset? To answer this ques-
theless gives some indication of difference. tion empirically, we took the basic Synth dataset described
in Section 6.1. The basic dataset has 5 clustering exam-
Results of the comparison are seen in Table 4. These results ples, 5 clusters, 750 features, and 100 points. To test the
reflect the feeling one might get glancing at Table 3: there algorithms in a controlled way, we varied each of these char-
is no clear winner in WebKB or News. The exception is the acteristics (examples, clusters, features, points), and trained
Synth synthetic data set, where the Iterative trained model over 20 training sets to test the time it took to train a model.
appears to yield superior performance. Results are reported for both Iterative and Spectral cluster-
ing. The regularization parameter C = 104 was constant in
6.3.3 WebKB-N versus WebKB-L all training methods.
Table 4: Counts of the times within Table 3 the
Iterative trained model won, tied, or lost versus the 300
Spectral trained model respectively. Iterative
Spectral
250
Dataset Win Tie Lose W ns/r P1-tail
WebKB-L 2 0 2 4 4 >0.05
WebKB-N 0 0 4 4 4 >0.05 200

seconds
News 8 3 10 30 18 0.2611
Synth 5 0 0 15 5 0.05 150

100
160
Iterative 50
140 Spectral
0
120 2 4 6 8 10 12 14 16 18 20
100 clusters
seconds

80
Figure 2: Training time versus number of clusters
60 in each example.
40

20

0
0 2 4 6 8 10 12 14 16 18 20 1800
Iterative
clusterings 1600 Spectral
1400
Figure 1: Training time versus number of example
1200
clusterings in the training set.
seconds

1000
800
As we increase the number of training example clusterings 600
in our training set, Figure 1 reveals a relationship linear
for Spectral and approximately linear for Iterative. That 400
training time is linear in the number of training examples is 200
expected [12, 13]. 0
10000 20000 30000 40000
Figure 2 shows that increasing the number of clusters while dimensions
holding other statistics constant leads to a steady decrease
in training time for Spectral trained methods. This appears Figure 3: Training time versus number of features.
to be a symptom of the difficulty of learning this dataset:
the number of points and dimensions is constant, but spread
over an increasing number of clusters in each example. Con-
sequently the best hypothesis that can be reasonably ex-
tracted from the provided data becomes weaker, and fewer 160
iterations are required to converge. The Iterative method, Iterative
on the other hand, often takes longer. Logs reveal this is due 140 Spectral
to one or two iterations where Iterative as separation oracle 120
took a very long time to converge, explaining the unstable
nature of the curve. 100
seconds

80
Figure 3 shows a linear relationship of number of features
versus training time. This linear time relationship is unsur- 60
prising given that computing similarities and Ψ is linear in 40
the number of features.
20
Figure 4 shows Spectral time complexity as a straightfor- 0
ward polynomially increasing curve (due to the LAPACK 20 40 60 80 100 120 140 160 180 200
DSYEVR eigenpair procedure working on steadily larger ma- points
trices). The Iterative trained classifier also tends to increase
with number of points, with a hump on lower numbers of Figure 4: Training time versus number of points.
points arising from Iterative clustering often requiring more
time for the clusterer to converge on smaller datasets, a ten-
dency reversed as more points presumably smooth the search national/tenth conference on Artificial
space. intelligence/Innovative applications of artificial
intelligence, pages 509–516, Menlo Park, CA, USA,
One theme seen throughout these experiments is that the 1998. American Association for Artificial Intelligence.
timing behavior of relaxed spectral training is very pre- [6] T. De Bie, M. Momma, and N. Cristianini. Efficiently
dictable relative to the discrete k-means training. Consider- learning the metric using side-information. In
ing the somewhat unpredictable nature of local search versus ALT2003, volume 2842, pages 175–189. Springer, 2003.
largely deterministic matrix computations, it is unsurpris- [7] I. S. Dhillon, Y. Guan, and J. Kogan. Iterative
ing to see the latters relative stability carry over into model clustering of high dimensional text data augmented by
training time. local search. In ICDM ’02: Proceedings of the 2002
IEEE International Conference on Data Mining
7. CONCLUSIONS (ICDM’02), page 131, Washington, DC, USA, 2002.
We provided a means to parameterize the popular canonical IEEE Computer Society.
k-means clustering algorithm based on learning a similarity [8] I. S. Dhillon, Y. Guan, and B. Kulis. A unified view of
measure between item pairs, and then provided a supervised kernel k-means, spectral clustering and graph cuts.
k-means clustering method to learn these parameterizations Technical Report TR-04-25, University of Texas Dept.
using a structural SVM. The supervised k-means clustering of Computer Science, 2005.
method learns this similarity measure based on a training [9] T. Finley. SVMpython , 2007. Software at
set of item sets and complete partitionings over those sets, http://www.cs.cornell.edu/~tomf/svmpython2/.
choosing parameterizations optimized for good performance [10] T. Finley and T. Joachims. Supervised clustering with
over the training set. support vector machines. In ICML, 2005.
[11] P. Haider, U. Brefeld, and T. Scheffer. Supervised
We then theoretically characterized the learning algorithm, clustering of streaming data for email batch detection.
drawing a distinction between the iterative local search k- In ICML, pages 345–352, New York, NY, USA, 2007.
means clustering method and the relaxed spectral relax- ACM.
ation, as leading to underconstrained and overconstrained [12] T. Joachims. Training linear svms in linear time. In
supervised k-means clustering learners, respectively. Empir- KDD, pages 217–226, New York, NY, USA, 2006.
ically, the supervised k-means clustering algorithms exhib- ACM.
ited superior performance compared to naive pairwise learn- [13] T. Joachims, T. Finley, and C.-N. J. Yu.
ing or unsupervised k-means. The underconstrained and Cutting-plane training of structural SVMs. In Under
overconstrained supervised k-means clustering learners com- Submission, 2007. Temporarily at
pared to each other exhibited different performance, though www.cs.cornell.edu/~tomf/publications/linearstruct07.pdf.
neither was clearly consistently superior to the other. We
[14] J. M. Kleinberg. Hubs, authorities, and communities.
also characterized the runtime behavior of both the super-
ACM Comput. Surv., page 5.
vised k-means clustering learners through an empirical anal-
ysis on datasets with varying numbers of examples, clusters, [15] G. R. G. Lanckriet, N. Christianini, P. L. Bartlett,
features, and items to cluster. We find training time which L. E. Ghaoui, and M. I. Jordan. Learning the kernel
is linear or better in the number of example clusterings, clus- matrix with semi-definite programming. In ICML ’02:
Proceedings of the Nineteenth International Conference
ters per example, and number of features.
on Machine Learning, pages 323–330, San Francisco,
CA, USA, 2002. Morgan Kaufmann Publishers Inc.
8. ACKNOWLEDGMENTS [16] V. Ng and C. Cardie. Improving machine learning
This work was supported under NSF Award IIS-0713483 approaches to coreference resolution. In ACL-02,
“Learning Structure to Structure Mapping,” and through a pages 104–111, 2002.
gift from Yahoo! Inc. [17] W. M. Rand. Objective criteria for the evaluation of
clustering methods. Journal of the American
9. REFERENCES Statistical Association, 66(366):846–850, 1971.
[1] F. R. Bach and M. I. Jordan. Learning spectral [18] B. Taskar, V. Chatalbashev, and D. Koller. Learning
clustering. In S. Thrun, L. K. Saul, and B. Schölkopf, associative Markov networks. In ICML, page 102, New
editors, NIPS. MIT Press, 2003. York, NY, USA, 2004. ACM.
[2] N. Bansal, A. Blum, and S. Chawla. Correlation [19] B. Taskar, C. Guestrin, and D. Koller. Max-margin
clustering. Machine Learning, 56(1-3):89–113, 2002. Markov networks. In NIPS 16. 2003.
[3] S. Basu, M. Bilenko, and R. J. Mooney. A [20] I. Tsochantaridis, T. Hofmann, T. Joachims, and
probabilistic framework for semi-supervised clustering. Y. Altun. Support vector machine learning for
In ACM SIGKDD-2004, pages 59–68, August 2004. interdependent and structured output spaces. In
[4] M. Bilenko, S. Basu, and R. J. Mooney. Integrating ICML, 2004.
constraints and metric learning in semi-supervised
clustering. In ICML, New York, NY, USA, 2004. ACM
Press.
[5] M. Craven, D. DiPasquo, D. Freitag, A. McCallum,
T. Mitchell, K. Nigam, and S. Slattery. Learning to
extract symbolic knowledge from the world wide web.
In AAAI ’98/IAAI ’98: Proceedings of the fifteenth

You might also like