Document Clustering Using Linear Partitioning Hyperplanes and Reallocation

Kruengkrai, Canasai; Sornlertlamvanich, Virach; Isahara, Hitoshi

doi:10.1007/978-3-540-31871-2_4

Canasai Kruengkrai²⁰,
Virach Sornlertlamvanich²⁰ &
Hitoshi Isahara²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3411))

Included in the following conference series:

Asia Information Retrieval Symposium

406 Accesses
1 Citations

Abstract

This paper presents a novel algorithm for document clustering based on a combinatorial framework of the Principal Direction Divisive Partitioning (PDDP) algorithm [1] and a simplified version of the EM algorithm called the spherical Gaussian EM (sGEM) algorithm. The idea of the PDDP algorithm is to recursively split data samples into two sub-clusters using the hyperplane normal to the principal direction derived from the covariance matrix. However, the PDDP algorithm can yield poor results, especially when clusters are not well-separated from one another. To improve the quality of the clustering results, we deal with this problem by re-allocating new cluster membership using the sGEM algorithm with different settings. Furthermore, based on the theoretical background of the sGEM algorithm, we can naturally extend the framework to cover the problem of estimating the number of clusters using the Bayesian Information Criterion. Experimental results on two different corpora are given to show the effectiveness of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Robust and compact maximum margin clustering for high-dimensional data

Article Open access 17 January 2024

Diagonal Co-clustering Algorithm for Document-Word Partitioning

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

References

Boley, D.: Principal direction divisive partitioning. Data Mining and Knowledge Discovery 2(4), 325–344 (1998)
Article Google Scholar
Boley, D., Borst, V.: Unsupervised clustering: A fast scalable method for large datasets. CSE Report TR-99-029, University of Minnesota (1999)
Google Scholar
Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 91–99 (1998)
Google Scholar
Chickering, D., Heckerman, D., Meek, C.: A bayesian approach to learning bayesian networks with local structure. In: Proceedings of the thirteenth Conference on Uncertainty in Artificial Intelligence, pp. 80–89. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Dasgupta, S., Schulman, L.J.: A two-round variant of em for gaussian mixtures. In: Sixteenth Conference on Uncertainty in Artificial Intelligence, UAI (2000)
Google Scholar
Golub, G., Loan, C.V.: Matrix Computations. The Johns Hopkins University Press, Baltimore (1989)
MATH Google Scholar
Hamerly, G., Elkan, C.: Learning the k in k-means. In: Proceedings of the seventeenth annual conference on neural information processing systems, NIPS (2003)
Google Scholar
He, J., Tan, A.-H., Tan, C.-L., Sung, S.-Y.: On Quantitative Evaluation of Clustering Systems. In: Wu, W., Xiong, H. (eds.) Information Retrieval and Clustering, Kluwer Academic Publishers, Dordrecht (2003)
Google Scholar
Kass, R.E., Raftery, A.E.: Bayes factors. Journal of the American Statistical Association 90, 773–795 (1995)
Article MATH Google Scholar
Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–339 (1995)
Google Scholar
McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, http://www.cs.cmu.edu/~mccallum/bow
Rasmussen, E.: Clustering algorithms. In: Frakes, W., Baeza-Yates, R. (eds.) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs (1992)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal 24(5), 513–523 (1988)
Article Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (1999)
Google Scholar
Strehl, A., Ghosh, J., Mooney, R.J.: Impact of similarity measures on web-page clustering. In: Proceedings of AAAI Workshop on AI for Web Search, pp. 58–64 (2000)
Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal on Machine Learning Research 3, 583–617 (2002)
Article MathSciNet Google Scholar
Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. In: SDM Workshop on Clustering High Dimensional Data and Its Applications (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, 112 Paholyothin Road, Klong 1, Klong Luang, Pathumthani, 12120, Thailand
Canasai Kruengkrai, Virach Sornlertlamvanich & Hitoshi Isahara

Authors

Canasai Kruengkrai
View author publications
You can also search for this author in PubMed Google Scholar
Virach Sornlertlamvanich
View author publications
You can also search for this author in PubMed Google Scholar
Hitoshi Isahara
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, Information and Communications University, 119, Munjiro, Yuseong-gu, 305-732, Daejeon, Korea
Sung Hyon Myaeng
The Key Laboratory of Power System Protection and Dynamic Security Monitoring and Control under Ministry of Education, North China Electric Power University, Zhuxinzhuang Dewai, 102206, Beijing, China
Ming Zhou
Department of Systems Engineering and Engineering Management, Shatin, The Chinese University of Hong Kong, Hong Kong, N.T.
Kam-Fai Wong
5F, Beijing Sigma Center, Microsoft Research Asia, No. 49 Zhichun Road Haidian District, 100080, Beijing, China
Hong-Jiang Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kruengkrai, C., Sornlertlamvanich, V., Isahara, H. (2005). Document Clustering Using Linear Partitioning Hyperplanes and Reallocation. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-540-31871-2_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25065-4
Online ISBN: 978-3-540-31871-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Document Clustering Using Linear Partitioning Hyperplanes and Reallocation

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Robust and compact maximum margin clustering for high-dimensional data

Diagonal Co-clustering Algorithm for Document-Word Partitioning

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Document Clustering Using Linear Partitioning Hyperplanes and Reallocation

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Robust and compact maximum margin clustering for high-dimensional data

Diagonal Co-clustering Algorithm for Document-Word Partitioning

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation