Abstract
This paper presents a novel algorithm for document clustering based on a combinatorial framework of the Principal Direction Divisive Partitioning (PDDP) algorithm [1] and a simplified version of the EM algorithm called the spherical Gaussian EM (sGEM) algorithm. The idea of the PDDP algorithm is to recursively split data samples into two sub-clusters using the hyperplane normal to the principal direction derived from the covariance matrix. However, the PDDP algorithm can yield poor results, especially when clusters are not well-separated from one another. To improve the quality of the clustering results, we deal with this problem by re-allocating new cluster membership using the sGEM algorithm with different settings. Furthermore, based on the theoretical background of the sGEM algorithm, we can naturally extend the framework to cover the problem of estimating the number of clusters using the Bayesian Information Criterion. Experimental results on two different corpora are given to show the effectiveness of our algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Boley, D.: Principal direction divisive partitioning. Data Mining and Knowledge Discovery 2(4), 325–344 (1998)
Boley, D., Borst, V.: Unsupervised clustering: A fast scalable method for large datasets. CSE Report TR-99-029, University of Minnesota (1999)
Bradley, P.S., Fayyad, U.M.: Refining initial points for k-means clustering. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 91–99 (1998)
Chickering, D., Heckerman, D., Meek, C.: A bayesian approach to learning bayesian networks with local structure. In: Proceedings of the thirteenth Conference on Uncertainty in Artificial Intelligence, pp. 80–89. Morgan Kaufmann, San Francisco (1997)
Dasgupta, S., Schulman, L.J.: A two-round variant of em for gaussian mixtures. In: Sixteenth Conference on Uncertainty in Artificial Intelligence, UAI (2000)
Golub, G., Loan, C.V.: Matrix Computations. The Johns Hopkins University Press, Baltimore (1989)
Hamerly, G., Elkan, C.: Learning the k in k-means. In: Proceedings of the seventeenth annual conference on neural information processing systems, NIPS (2003)
He, J., Tan, A.-H., Tan, C.-L., Sung, S.-Y.: On Quantitative Evaluation of Clustering Systems. In: Wu, W., Xiong, H. (eds.) Information Retrieval and Clustering, Kluwer Academic Publishers, Dordrecht (2003)
Kass, R.E., Raftery, A.E.: Bayes factors. Journal of the American Statistical Association 90, 773–795 (1995)
Lang, K.: Newsweeder: Learning to filter netnews. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 331–339 (1995)
McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, http://www.cs.cmu.edu/~mccallum/bow
Rasmussen, E.: Clustering algorithms. In: Frakes, W., Baeza-Yates, R. (eds.) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs (1992)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal 24(5), 513–523 (1988)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (1999)
Strehl, A., Ghosh, J., Mooney, R.J.: Impact of similarity measures on web-page clustering. In: Proceedings of AAAI Workshop on AI for Web Search, pp. 58–64 (2000)
Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal on Machine Learning Research 3, 583–617 (2002)
Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. In: SDM Workshop on Clustering High Dimensional Data and Its Applications (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kruengkrai, C., Sornlertlamvanich, V., Isahara, H. (2005). Document Clustering Using Linear Partitioning Hyperplanes and Reallocation. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-31871-2_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25065-4
Online ISBN: 978-3-540-31871-2
eBook Packages: Computer ScienceComputer Science (R0)