A Spectroscopy of Texts for Effective Clustering

Li, Wenyuan; Ng, Wee-Keong; Ong, Kok-Leong; Lim, Ee-Peng

doi:10.1007/978-3-540-30116-5_29

Wenyuan Li²²,
Wee-Keong Ng²²,
Kok-Leong Ong²³ &
…
Ee-Peng Lim²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3202))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2269 Accesses

Abstract

For many clustering algorithms, such as k-means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters to return. In the presence of different data characteristics and analysis contexts, it is often difficult for the user to estimate the number of clusters in the data set. This is especially true in text collections such as Web documents, images or biological data. The fundamental question this paper addresses is: “How can we effectively estimate the natural number of clusters in a given text collection?”. We propose to use spectral analysis, which analyzes the eigenvalues (not eigenvectors) of the collection, as the solution to the above. We first present the relationship between a text collection and its underlying spectra. We then show how the answer to this question enhances the clustering process. Finally, we conclude with empirical results and related work.

spectroscopy n. the study of spectra or spectral analysis.

Download to read the full chapter text

Chapter PDF

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Evaluation of Text Clustering Methods and Their Dataspace Embeddings: An Exploration

A mixture model approach to spectral clustering and application to textual data

Article 20 April 2022

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Salvador, S., Chan, P.: Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. Technical report 2003-18, Florida Institute of Technology (2003)
Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistic. Technical Report 208, Dept. of Statistics, Stanford University (2000)
Google Scholar
Sugar, C., James, G.: Finding the number of clusters in a data set: An information theoretic approach. Journal of the American Statistical Association 98 (2003)
Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm. Journal of Royal Statistical Society 39, 1–38 (1977)
MATH MathSciNet Google Scholar
Evans, F., Alder, M., de Silva, C.: Determining the number of clusters in a mixture by iterative model space refinement with application to free-swimming fish detection. In: Proc. of Digital Imaging Computing: Techniques and Applications, Sydney, Australia (2003)
Google Scholar
Yang, Y., Guan, X., You, J.: CLOPE: A fast and effective clustering algorithm for transactional data. In: Proc. of KDD, Edmonton, Canada, pp. 682–687 (2002)
Google Scholar
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proc. of AAAI Workshop on AI for Web Search, pp. 58–64 (2000)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31, 264–323 (1999)
Article Google Scholar
Chung, F.R.K.: Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, vol. 92. American Mathematical Society, Providence (1997)
MATH Google Scholar
Golub, G., Loan, C.V.: Matrix Computations (Johns Hopkins Series in the Mathematical Sciences), 3rd edn. The Johns Hopkins University Press, Baltimore (1996)
Google Scholar
Landauer, T., Foltz, P., Laham, D.: Introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)
Article Google Scholar
Sinka, M.P., Corne, D.W.: A Large Benchmark Dataset for Web Document Clustering. In: Soft Computing Systems: Design, Management and Applications, pp. 881–890. IOS Press, Amsterdam (2002)
Google Scholar
Li, W., Ng, W.K., Ong, K.L., Lim, E.P.: A spectroscopy of texts for effective clustering. Technical Report TRC04/03, Deakin University (2004), http://www.deakin.edu.au/~leong/papers/tr2
Gordon, A.: Classification, 2nd edn. Chapman and Hall/CRC, Boca Raton (1999)
MATH Google Scholar
LANSO: (Dept. of Computer Science and the Industrial Liason Office, Univ. of Calif., Berkeley)
Google Scholar
Wu, K., Simon, H.: A parallel lanczos method for symmetric generalized eigenvalue problems. Technical Report 41284, LBNL (1997)
Google Scholar
Kannan, R., Vetta, A.: On clusterings: good, bad and spectral. In: Proc. of FOCS, Redondo Beach, pp. 367–377 (2000)
Google Scholar
Smyth, P.: Clustering using monte carlo cross-validation. In: Proc. of KDD, Portland, Oregon, USA, pp. 126–133 (1996)
Google Scholar
Baxter, R., Oliver, J.: The kindest cut: minimum message length segmentation. In: Proc. Int. Workshop on Algorithmic Learning Theory, pp. 83–90 (1996)
Google Scholar
Hansen, M., Yu, B.: Model selection and the principle of minimum description length. Journal of the American Statistical Association 96, 746–774 (2001)
Article MATH MathSciNet Google Scholar
Roth, V., Lange, T., Braun, M., Buhmann, J.: A resampling approach to cluster validation. In: Proc. of COMPSTAT, Berlin, Germany (2002)
Google Scholar
Tibshirani, R., Walther, G., Botstein, D., Brown, P.: Cluster validation by prediction strength. Technical report, Stanford University (2001)
Google Scholar
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R.: Indexing by latent semantic analysis. JASIS 41, 391–407 (1990)
Article Google Scholar
Azar, Y., Fiat, A., Karlin, A., McSherry, F., Saia, J.: Spectral analysis of data. In: ACM Symposium on Theory of Computing, Greece, pp. 619–626 (2001)
Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 604–632 (1999)
Article MATH MathSciNet Google Scholar
Vukadinovic, D., Huan, P., Erlebach, T.: A spectral analysis of the internet topology. Technical Report 118, ETH TIK-NR (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Advanced Information Systems, Nanyang Technological University, Nanyang Avenue, N4-B3C-14, 639798, Singapore
Wenyuan Li, Wee-Keong Ng & Ee-Peng Lim
School of Information Technology, Deakin University, Waurn Ponds, Victoria, 3217, Australia
Kok-Leong Ong

Authors

Wenyuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Wee-Keong Ng
View author publications
You can also search for this author in PubMed Google Scholar
Kok-Leong Ong
View author publications
You can also search for this author in PubMed Google Scholar
Ee-Peng Lim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INSA-Lyon, LIRIS CNRS UMR5205, F-69621, Villeurbanne, France
Jean-François Boulicaut
Dipartimento di Informatica, Università degli Studi di Bari,
Floriana Esposito
Pisa KDD Laboratory, ISTI - CNR, Area della Ricerca di Pisa, Via Giuseppe Moruzzi 1, Pisa, Italy
Fosca Giannotti
Dipartimento di Informatica, Via F. Buonarroti 2, 56127, Pisa, Italy
Dino Pedreschi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, W., Ng, WK., Ong, KL., Lim, EP. (2004). A Spectroscopy of Texts for Effective Clustering. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Knowledge Discovery in Databases: PKDD 2004. PKDD 2004. Lecture Notes in Computer Science(), vol 3202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30116-5_29

Download citation

DOI: https://doi.org/10.1007/978-3-540-30116-5_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23108-0
Online ISBN: 978-3-540-30116-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

A Spectroscopy of Texts for Effective Clustering

Abstract

Chapter PDF

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

Evaluation of Text Clustering Methods and Their Dataspace Embeddings: An Exploration

A mixture model approach to spectral clustering and application to textual data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Spectroscopy of Texts for Effective Clustering

Abstract

Chapter PDF

Similar content being viewed by others

A comprehensive and analytical review of text clustering techniques

Evaluation of Text Clustering Methods and Their Dataspace Embeddings: An Exploration

A mixture model approach to spectral clustering and application to textual data

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation