Abstract
For many clustering algorithms, such as k-means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters to return. In the presence of different data characteristics and analysis contexts, it is often difficult for the user to estimate the number of clusters in the data set. This is especially true in text collections such as Web documents, images or biological data. The fundamental question this paper addresses is: “How can we effectively estimate the natural number of clusters in a given text collection?”. We propose to use spectral analysis, which analyzes the eigenvalues (not eigenvectors) of the collection, as the solution to the above. We first present the relationship between a text collection and its underlying spectra. We then show how the answer to this question enhances the clustering process. Finally, we conclude with empirical results and related work.
spectroscopy n. the study of spectra or spectral analysis.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. Technical report 2003-18, Florida Institute of Technology (2003)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistic. Technical Report 208, Dept. of Statistics, Stanford University (2000)
Sugar, C., James, G.: Finding the number of clusters in a data set: An information theoretic approach. Journal of the American Statistical Association 98 (2003)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistical Society 39, 1–38 (1977)
Evans, F., Alder, M., de Silva, C.: Determining the number of clusters in a mixture by iterative model space refinement with application to free-swimming fish detection. In: Proc. of Digital Imaging Computing: Techniques and Applications, Sydney, Australia (2003)
Yang, Y., Guan, X., You, J.: CLOPE: A fast and effective clustering algorithm for transactional data. In: Proc. of KDD, Edmonton, Canada, pp. 682–687 (2002)
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proc. of AAAI Workshop on AI for Web Search, pp. 58–64 (2000)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31, 264–323 (1999)
Chung, F.R.K.: Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, vol. 92. American Mathematical Society, Providence (1997)
Golub, G., Loan, C.V.: Matrix Computations (Johns Hopkins Series in the Mathematical Sciences), 3rd edn. The Johns Hopkins University Press, Baltimore (1996)
Landauer, T., Foltz, P., Laham, D.: Introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)
Sinka, M.P., Corne, D.W.: A Large Benchmark Dataset for Web Document Clustering. In: Soft Computing Systems: Design, Management and Applications, pp. 881–890. IOS Press, Amsterdam (2002)
Li, W., Ng, W.K., Ong, K.L., Lim, E.P.: A spectroscopy of texts for effective clustering. Technical Report TRC04/03, Deakin University (2004), http://www.deakin.edu.au/~leong/papers/tr2
Gordon, A.: Classification, 2nd edn. Chapman and Hall/CRC, Boca Raton (1999)
LANSO: (Dept. of Computer Science and the Industrial Liason Office, Univ. of Calif., Berkeley)
Wu, K., Simon, H.: A parallel lanczos method for symmetric generalized eigenvalue problems. Technical Report 41284, LBNL (1997)
Kannan, R., Vetta, A.: On clusterings: good, bad and spectral. In: Proc. of FOCS, Redondo Beach, pp. 367–377 (2000)
Smyth, P.: Clustering using monte carlo cross-validation. In: Proc. of KDD, Portland, Oregon, USA, pp. 126–133 (1996)
Baxter, R., Oliver, J.: The kindest cut: minimum message length segmentation. In: Proc. Int. Workshop on Algorithmic Learning Theory, pp. 83–90 (1996)
Hansen, M., Yu, B.: Model selection and the principle of minimum description length. Journal of the American Statistical Association 96, 746–774 (2001)
Roth, V., Lange, T., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: Proc. of COMPSTAT, Berlin, Germany (2002)
Tibshirani, R., Walther, G., Botstein, D., Brown, P.: Cluster validation by prediction strength. Technical report, Stanford University (2001)
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R.: Indexing by latent semantic analysis. JASIS 41, 391–407 (1990)
Azar, Y., Fiat, A., Karlin, A., McSherry, F., Saia, J.: Spectral analysis of data. In: ACM Symposium on Theory of Computing, Greece, pp. 619–626 (2001)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 604–632 (1999)
Vukadinovic, D., Huan, P., Erlebach, T.: A spectral analysis of the internet topology. Technical Report 118, ETH TIK-NR (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, W., Ng, WK., Ong, KL., Lim, EP. (2004). A Spectroscopy of Texts for Effective Clustering. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_29
Download citation
DOI: https://doi.org/10.1007/978-3-540-30116-5_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive