Supervised Kmeans-08
Supervised Kmeans-08
seconds
News 8 3 10 30 18 0.2611
Synth 5 0 0 15 5 0.05 150
100
160
Iterative 50
140 Spectral
0
120 2 4 6 8 10 12 14 16 18 20
100 clusters
seconds
80
Figure 2: Training time versus number of clusters
60 in each example.
40
20
0
0 2 4 6 8 10 12 14 16 18 20 1800
Iterative
clusterings 1600 Spectral
1400
Figure 1: Training time versus number of example
1200
clusterings in the training set.
seconds
1000
800
As we increase the number of training example clusterings 600
in our training set, Figure 1 reveals a relationship linear
for Spectral and approximately linear for Iterative. That 400
training time is linear in the number of training examples is 200
expected [12, 13]. 0
10000 20000 30000 40000
Figure 2 shows that increasing the number of clusters while dimensions
holding other statistics constant leads to a steady decrease
in training time for Spectral trained methods. This appears Figure 3: Training time versus number of features.
to be a symptom of the difficulty of learning this dataset:
the number of points and dimensions is constant, but spread
over an increasing number of clusters in each example. Con-
sequently the best hypothesis that can be reasonably ex-
tracted from the provided data becomes weaker, and fewer 160
iterations are required to converge. The Iterative method, Iterative
on the other hand, often takes longer. Logs reveal this is due 140 Spectral
to one or two iterations where Iterative as separation oracle 120
took a very long time to converge, explaining the unstable
nature of the curve. 100
seconds
80
Figure 3 shows a linear relationship of number of features
versus training time. This linear time relationship is unsur- 60
prising given that computing similarities and Ψ is linear in 40
the number of features.
20
Figure 4 shows Spectral time complexity as a straightfor- 0
ward polynomially increasing curve (due to the LAPACK 20 40 60 80 100 120 140 160 180 200
DSYEVR eigenpair procedure working on steadily larger ma- points
trices). The Iterative trained classifier also tends to increase
with number of points, with a hump on lower numbers of Figure 4: Training time versus number of points.
points arising from Iterative clustering often requiring more
time for the clusterer to converge on smaller datasets, a ten-
dency reversed as more points presumably smooth the search national/tenth conference on Artificial
space. intelligence/Innovative applications of artificial
intelligence, pages 509–516, Menlo Park, CA, USA,
One theme seen throughout these experiments is that the 1998. American Association for Artificial Intelligence.
timing behavior of relaxed spectral training is very pre- [6] T. De Bie, M. Momma, and N. Cristianini. Efficiently
dictable relative to the discrete k-means training. Consider- learning the metric using side-information. In
ing the somewhat unpredictable nature of local search versus ALT2003, volume 2842, pages 175–189. Springer, 2003.
largely deterministic matrix computations, it is unsurpris- [7] I. S. Dhillon, Y. Guan, and J. Kogan. Iterative
ing to see the latters relative stability carry over into model clustering of high dimensional text data augmented by
training time. local search. In ICDM ’02: Proceedings of the 2002
IEEE International Conference on Data Mining
7. CONCLUSIONS (ICDM’02), page 131, Washington, DC, USA, 2002.
We provided a means to parameterize the popular canonical IEEE Computer Society.
k-means clustering algorithm based on learning a similarity [8] I. S. Dhillon, Y. Guan, and B. Kulis. A unified view of
measure between item pairs, and then provided a supervised kernel k-means, spectral clustering and graph cuts.
k-means clustering method to learn these parameterizations Technical Report TR-04-25, University of Texas Dept.
using a structural SVM. The supervised k-means clustering of Computer Science, 2005.
method learns this similarity measure based on a training [9] T. Finley. SVMpython , 2007. Software at
set of item sets and complete partitionings over those sets, http://www.cs.cornell.edu/~tomf/svmpython2/.
choosing parameterizations optimized for good performance [10] T. Finley and T. Joachims. Supervised clustering with
over the training set. support vector machines. In ICML, 2005.
[11] P. Haider, U. Brefeld, and T. Scheffer. Supervised
We then theoretically characterized the learning algorithm, clustering of streaming data for email batch detection.
drawing a distinction between the iterative local search k- In ICML, pages 345–352, New York, NY, USA, 2007.
means clustering method and the relaxed spectral relax- ACM.
ation, as leading to underconstrained and overconstrained [12] T. Joachims. Training linear svms in linear time. In
supervised k-means clustering learners, respectively. Empir- KDD, pages 217–226, New York, NY, USA, 2006.
ically, the supervised k-means clustering algorithms exhib- ACM.
ited superior performance compared to naive pairwise learn- [13] T. Joachims, T. Finley, and C.-N. J. Yu.
ing or unsupervised k-means. The underconstrained and Cutting-plane training of structural SVMs. In Under
overconstrained supervised k-means clustering learners com- Submission, 2007. Temporarily at
pared to each other exhibited different performance, though www.cs.cornell.edu/~tomf/publications/linearstruct07.pdf.
neither was clearly consistently superior to the other. We
[14] J. M. Kleinberg. Hubs, authorities, and communities.
also characterized the runtime behavior of both the super-
ACM Comput. Surv., page 5.
vised k-means clustering learners through an empirical anal-
ysis on datasets with varying numbers of examples, clusters, [15] G. R. G. Lanckriet, N. Christianini, P. L. Bartlett,
features, and items to cluster. We find training time which L. E. Ghaoui, and M. I. Jordan. Learning the kernel
is linear or better in the number of example clusterings, clus- matrix with semi-definite programming. In ICML ’02:
Proceedings of the Nineteenth International Conference
ters per example, and number of features.
on Machine Learning, pages 323–330, San Francisco,
CA, USA, 2002. Morgan Kaufmann Publishers Inc.
8. ACKNOWLEDGMENTS [16] V. Ng and C. Cardie. Improving machine learning
This work was supported under NSF Award IIS-0713483 approaches to coreference resolution. In ACL-02,
“Learning Structure to Structure Mapping,” and through a pages 104–111, 2002.
gift from Yahoo! Inc. [17] W. M. Rand. Objective criteria for the evaluation of
clustering methods. Journal of the American
9. REFERENCES Statistical Association, 66(366):846–850, 1971.
[1] F. R. Bach and M. I. Jordan. Learning spectral [18] B. Taskar, V. Chatalbashev, and D. Koller. Learning
clustering. In S. Thrun, L. K. Saul, and B. Schölkopf, associative Markov networks. In ICML, page 102, New
editors, NIPS. MIT Press, 2003. York, NY, USA, 2004. ACM.
[2] N. Bansal, A. Blum, and S. Chawla. Correlation [19] B. Taskar, C. Guestrin, and D. Koller. Max-margin
clustering. Machine Learning, 56(1-3):89–113, 2002. Markov networks. In NIPS 16. 2003.
[3] S. Basu, M. Bilenko, and R. J. Mooney. A [20] I. Tsochantaridis, T. Hofmann, T. Joachims, and
probabilistic framework for semi-supervised clustering. Y. Altun. Support vector machine learning for
In ACM SIGKDD-2004, pages 59–68, August 2004. interdependent and structured output spaces. In
[4] M. Bilenko, S. Basu, and R. J. Mooney. Integrating ICML, 2004.
constraints and metric learning in semi-supervised
clustering. In ICML, New York, NY, USA, 2004. ACM
Press.
[5] M. Craven, D. DiPasquo, D. Freitag, A. McCallum,
T. Mitchell, K. Nigam, and S. Slattery. Learning to
extract symbolic knowledge from the world wide web.
In AAAI ’98/IAAI ’98: Proceedings of the fifteenth